wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Common contractions are missing from frequency_dictionary_en_82_765.txt #109

Open rogerbock opened 3 years ago

rogerbock commented 3 years ago

It looks like contractions were added as an afterthought to this list after it was constructed, as they appear at the end and have artificial counts:

can't 300000
won't 300000
don't 300000
couldn't 300000
shouldn't 300000
wouldn't 300000
needn't 300000
mustn't 300000
she'll 300000
we'll 300000
he'll 300000
they'll 300000
i'll 300000
i'm 300000

There are some very common contractions missing from this list, such as "didn't". This means that when I try to correct a phrase like "I didnt want that", I get the suggestion "I didst want that", which is not ideal.

Is this a known issue? Is there a better frequency dictionary to use that includes contractions? Or should I just add more entries with artificial counts? Thank you!

wolfgarbe commented 3 years ago

The frequency_dictionary_en_82_765.txt was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words.

Google Books Ngram data : Provides representative word frequencies http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

SCOWL - Spell Checker Oriented Word Lists (License) : Ensures genuine English vocabulary http://wordlist.aspell.net/

The Google Ngram data does not contain contraction, therefore they were missing also from the resulting dictionary and manually added afterwards with artificial counts. Missing contractions can be added.

The Books Ngram Viewer e.g. replaces "didn't" with "did not" to match how they processed the books. https://books.google.com/ngrams/graph?content=didn%27t&year_start=1800&year_end=2019

You can lookup the frequency of "did not" in the frequency_bigramdictionary_en_243_342.txt included in symspell and use this frequency for "didn't". That should be a better approximation than the artificial counts.

rogerbock commented 3 years ago

Thank you for this information! Do you know if there are any other transformations applied to the data used to create the dictionary besides "didn't" -> "did not"? Is there a list somewhere?

wolfgarbe commented 3 years ago

Google: "How does the Ngram Viewer handle punctuation? We apply a set of tokenization rules specific to the particular language. In English, contractions become two words (they're becomes the bigram they 're, we'll becomes we 'll, and so on). The possessive 's is also split off, but R'n'B remains one token. Negations (n't) are normalized so that don't becomes do not. In Russian, the diacritic ё is normalized to e, and so on. The same rules are applied to parse both the ngrams typed by users and the ngrams extracted from the corpora, which means that if you're searching for don't, don't be alarmed by the fact that the Ngram Viewer rewrites it to do not; it is accurately depicting usages of both don't and do not in the corpus. However, this means there is no way to search explicitly for the specific forms can't (or cannot): you get can't and can not and cannot all at once." https://books.google.com/ngrams/info

I guess this transformation is applied by Google to all all English contractions to generate the ngram data from the corpus. The Google ngram data then was used to generate the SymSpell dictionaries (single term dictionary and bigram dictionary). https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions (List of popular English contractions)

rogerbock commented 3 years ago

I added in the following contractions: aren't could've didn't doesn't hadn't hasn't haven't he'd he's here's how'd how'll how're i'd i've isn't it'd it'll it's let's might've o'clock she'd she's should've somebody's someone's something's that's there's they'd they're they've wasn't we'd we're we've weren't what's where's who'd who'll who're who's why'd why're why's you'd you'll you're you've I also added in the following common words that I noticed were not in the dictionary: covid hi I think this workaround is sufficient for my purposes, but I'll let you decide if you want to keep this issue open or not. Thank you for your help!

ghost commented 2 years ago

Attached are numerous contractions, some that do not include apostrophes (such as gonna and gimme).

contractions.txt

See: https://en.wiktionary.org/wiki/Category:English_contractions

wolfgarbe commented 2 years ago

Thank you. I'm sorry for the delay, its still on my to-do list ...