Why I'm building my own dictionary

How it started and how it's going

2022-04-29

A few months ago, I decided to build my very own steno dictionary for the English language from scratch. I wanted one that would be able to replace Plover's very own main.json that came by default. Why? Well there are several reasons, but in short, I'm not a huge fan of the inconsistencies and lack of rules that come with main.json.

This isn't just a complaint that I have with it; it's pretty well known in the Plover community that main.json is lacking in these areas. There's even a thread in the Plover Discord all about rewriting the dictionary with automation. The inconsistencies make it hard for beginners and learners, and force users to have to add entries to their dictionary which, many people including me, deem unnecessary.

You can find all the new rules and differences my dictionary has with main.json over at my GitHub repository. In this article I want to talk more about my motivations and why I decided to take on the momentous of manually adding over 60,000 entries to a steno dictionary.

Inconsistencies of `main.json`

main.json is Mirabai Knight's personal dictionary and it obviously works really well for her. But of course everyone writes differently. The way we perceive language and phonetics (such as accent and dialect differencesUnfortunately, my dictionary rewrite still depends on having a standard American accent, though there is one person working on an English theory that is not dependent on accents.) influence what rules work best for us and how we construct our own briefs. That is why stenographers don't learn (or can't learn) one "correct" dictionary/theory. We tailor our dictionaries to match how we think language should be compressed and what rules make sense to us.

That said, tailoring our dictionaries to our own writing style limits its usability to fewer people. Many of the entries in main.json are just simply what made sense to Mirabai at the time and did not follow carefully designed systematic rules. They may have followed Plover theory, but as I'll talk about later, the theory itself is still quite ambiguous in certain areas.

⊕ Even after you learn how to read raw steno...this is quite daunting.

The goal of my rewrite is to provide rules that make writing out words less uncertain and hairy. I'm also not a believer of keeping misleading entries in the Plover dictionary and leaving it to the learner to figure out which ones are incorrect.

So what are some examples of inconsistencies within Plover's main default dictionary? Well, here are just a few that I find annoying:


"PHO/TPHOT/TPHOUS": "monotonous",
"PHOE/TPHOT/TPHOUS": "monotonous",
"PHOE/TPHOT/TPHUS": "monotonous",
"PHOPB/OT/TPHOUS": "monotonous",
"PHOPB/OT/TPHUS": "monotonous",
"PHOPB/TOPB/TPHUS": "monotonous",
"PHOPB/TPHOT/TPHOUS": "monotonous",

Which of the above should you use?If you must know, any of the entries using US I'd consider as "wrong" in main.json and would advise against using.

Now of course, the whole point of main.json is that it will have all the different ways of breaking up a word into its syllables. That way you don't have to think about how you have to break up a word. It's a good principle in theoryOne issue with just including as many translations as possible for a single word is that it confuses the learner if it's known that some entries might be "wrong" (i.e., misstrokes)., but it's just not executed well in main.json. You might notice that PHOPB/TPHOT/TPHUS is not defined even though it should work.

Although more experienced users of Plover and its dictionary may be able to tell you that main.json prefers OUS over US when a word is spelled with "-ous", this is not written anywhere in any resources. This is, of course, not a hard problem to fix; just write about these hidden rules online and link to them in the Learning Resources page. But the dictionary's inconsistencies make such rules hard to set in stone.

If the dictionary is going to have entries with US and OUS, both of them should be consistently used. There shouldn't be any priority between the two as that will just lead to unnecessary entries and confusion for learners. Many people use the dictionary lookup feature to learn Plover and these kinds of inconsistent entries make it really hard. I also need to mention that this isn't just a problem for "-ous" specifically. This is just one example of an inconsistency within main.json.

Another gripe I have with the above example is that all of these entries rely on the user having to drop unstressed vowels. This is, of course, a good thing as you'll be writing shorter—but sometimes you just don't know which vowels are unstressed. For court reporters and captioners, I suppose this isn't a huge problem as you're always hearing words and translating them to text.

However, for someone like me who likes using steno for typing tests, it's just not ideal. I think my mind works a little more orthographically which is why I like to include entries like PHO/TPHO/TO/TPHUSI use US exclusively in my new dictionary for "-ous" spellings in my new dictionary. I always strive to include as many entries as possible where stress is irrelevant in discerning how to stroke a word.

Believe it or not, there is still more to the example above that bugs me. In Plover's main.json you'll often find entries where consonants are doubled. Such as PHOPB/TPHOT/TPHUS where the first n is represented twice in two strokes.

This annoys me so much.

It is just so hard to tell whether or not you should double a consonant between two strokes in main.json. This is why I decided to just do away with doubling consonants in my new dictionary. It has led to a little bit of trouble, but I think it's far better than having to decide if a consonant is doubled or not. It's really hard to pin down what the rule is in main.json. I never really figured it out apart gaining an intuition and a "feel" for it.

Finally, the last major problem I have with main.json that's depicted in the example above is how sometimes strokes in the middle of an outline can start with a vowel. This always seemed kind of unnatural to me. I'd rather break up words in a way that every stroke begins with a consonant. I think that makes the most sense which is what I've implemented in my dictionary.

Sometimes I've seen this being called as "banana splitting""Left hand greedy" is also another name I've heard which is probably more appropriate. because it makes the most sense on a word like "banana". You wouldn't pronounce it as "ban-a-na" or "ban-an-a", so why should you write it that way? The most sensible outline to write it would be PWA/TPHA/TPHA, "ba-na-na".

These are all the inconsistencies that bug me the most in main.json. Although I've only given an example of one word, you can easily find many words that have the same inconsistencies I've listed already. In fact, I only had to look through the dictionary for 2 minutes to find "monotonous". I think that just goes to show how consistently inconsistent the dictionary is.

⊕ The joys of main.json.

Other minor considerations

There aren't that many other things that bug with main.json which I haven't listed above, but I figured I should just try to list them for completeness sake:

Very technical medical jargon that I'd never use
Misstrokes that are confusing to beginners
No easy way of disambiguating between names and words (e.g., Mark and mark)
I really don't like the traditional number bar system

How my dictionary differs

⊕ One Discord user illustrating how annoying misstrokes can be.

I wrote down a few rules on how to break up syllables so that there is usually only a few ways to to so. Because of slight accent differences, many words still have multiple ways of writing them (e.g., pronouncing "economic" with a long initial E or a short initial E). But syllable breaks are a lot more consistent. And as I've mentioned, my dictionary has quite a few fully written out entries where unstressed vowels are not dropped. This eliminates the issue of having to figure out stress before writing a word. I've done away with doubling consonants, and I've also used KWR as an initial silent consonant for joining syllables together which is helpful in suffixes as well as when it's hard for a stroke to begin with a consonant. Again, the nitty-gritty details are over at the repository description.

Now of course, the dictionary is not without its flaws. Having been spoiled about main.json, I've had to add a few exceptions to the rules so that my writing doesn't have to change too much to adapt to the new dictionary. It's amazing how you can learn its inconsistencies. And with all dictionaries, there will always be inconsistencies especially due to word boundary issues (e.g., KOPL/PAT being "come pat" or "compat")I ended up making KPAT the only way to write "compat"..

I've also added some exceptions due to how much Plover has to backspace with multistroke translations. For example, if I had PWU translate as "about you", and PWU/TER translate as "butter", you would first see "about you" after writing PWU followed by it all being backspaced by Plover after writing TER as "butter" is finally outputted. I don't like this behavior, so I've tried my best to limit it by instead making PWUT/*ER the valid outline.Unfortunately, there's not much around this in Plover and writing translations with 3 or more strokes will always be a little bit hairy.

The main problem I'm having with this dictionary rewrite is that "banana splitting" isn't actually all that natural to me for some words. I'm not sure if it's due to one and a half years of main.json or just that it's not always ideal. Any words containing an R after a vowel is always tricky because the R makes the vowel slightly different.

Take, for example, "carrot".⊕You could, of course, one-stroke this with KAEURT but that's besides the point right now. How would one write this with "banana splitting"? It would have to be KA/ROT or KAEU/ROT. Notice how both examples are quite unnatural. The vowel sound in "carrot" is slightly different than a long A Technically a diphthong, but most North Americans know it as a "long vowel". but it's also different from just a short A, so you can't really break it between the A and the R. The R is really part of the vowel. What you'd want is something more like KAEUR/KWROT or KAR/KWROT with the KWR acting as a linker.

I'm really not sure how to proceed with this problem. It's not really gone well, to be honest, and is the biggest inconsistency with my dictionary. I'm already 60k entries in, so there's not much I think I can really do about it. Maybe after the fact I'll think of something. But for now, I've reverted to trying to include the "banana splitting" method as well as using a linker to keep the vowel and R together.

My current thinking is to treat the vowel like a short vowel, and simply just break it in between the vowel and the R regardless of how the vowel is pronounced.KA/ROT would be the only valid way of writing this out. This is how I deal with them most of the time. I also just add alternative entries where it's iffy and using a linker chord makes more sense and feels more natural.

I think a better solution could be to add a rule where the R should be treated as a vowel in these cases and instead always revert to using the KWR linker in order to keep the vowel and R together.KAEUR/KWROT and KAR/KWROT would be valid entries.

My current advice for words such as these is to simply not think about them too much and brief them instead. This has always been my main approach to inconsistencies which is why I personally have gotten along quite well with main.json.

The grind

What is it like to build a dictionary‽

There's really not much to it; just add entries from a word list. At first I used a list containing the top 3000 most common English words before alphabetically going through main.json as a word list. I started this back in early January, adding 1000 entries a day, and made it to L in mid March.

It was at this point that my Splitography had unfortunately broke. To make matters worse, school also got in the way of my routine. But I had already been using the dictionary without main.json two weeks after I started this grind. In its current state, it's actually fairly usable. In my daily use, I really only have to add about five entries everyday on average.

I've been meaning to get back to finishing this dictionary, but I have not had the time. With summer break just around the corner, however, I do expect I will eventually finish this project.

Conclusion

Even though this entire journey has been about building my own dictionary, I've actually learned quite a bit about main.json and have even taken inspiration from it for certain ideas. I've really come to understand word boundary issues a lot more and have become more conscious about adding entries that might have an impact in this area.

Dictionary building is quite boring, but if you spend maybe an hour to two everyday, you could get a pretty decent sized dictionary within a month.Especially if you start with a better word list than main.json I'd totally recommend it if you are interested in that sort of thing. I think it is the best way to familiarize yourself with your own style of writing. I can totally understand why steno students are usually only given a small starter dictionary and have to get to a decent size themselves.

Of course, it's a huge effort to build your own dictionary and I really don't want to make it seem like you have to spend hours and hours building your own dictionary just to be able to use steno. I got fairly fast with main.json long before I even dreamt of creating my own dictionary. It's just that the people who have had a similar experience aren't all that common.

The amount of people who discover steno and wish to learn it, only to be jaded by the inconsistencies they have to learn in main.json is quite disheartening. Hopefully when my dictionary is completed (or the proposed automated main.json rewrite on Discord goes through) fewer people will be turned away by these inconsistencies.If you're interested in using my dictionary in its current state, be prepared to have to fingerspell and add entries a lot.