wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Use bigram for spell checking #110

Open ierezell opened 3 years ago

ierezell commented 3 years ago

Hi, first of all thanks for this very nice piece of software !

I'm using the symspellpy port and it's working perfectly.

However, on some cases (in french for exemple) I have chat messages like randé vs instead of rendez vous or even je suit instead of je suis.

The later always is in my bigram dictionary and not the former.

So I was thinking about checking against all bigrams to have better spell checking than only single words which are in the unigram list and I was wondering if some kind of similar behaviour was already in symspell or if it was planned to be.

Thanks again, Have a wonderful day

wolfgarbe commented 3 years ago

SymSpell.LookupCompound should do exactly this. It uses the optional bigram dictionary (load with symSpell.LoadBigramDictionary) in order to use sentence level context information for selecting the best spelling correction for multiple input terms. But I haven't tested it for French.

ierezell commented 3 years ago

Hi @wolfgarbe, thanks a lot for the fast answer !

I did exactly that (symSpell.LoadBigramDictionary) with symspellpy (maybe the implementation differs ?). I created my own bigram dictionary from the google n-grams (btw I can offer the code in python if needed).

However, some chatbot sentence (really really bad writting) is not corrected correctly.

Here is an exemple, I hope it helps.

je peut pas recevoir mes 3 enfants avec leurs enfants cecqui fait 3 bukbes perce wue ils sont plus que 8 pas logique ni justeo
        |                                               |             |       |   |
je peut pas recevoir mes 3 enfants avec leurs enfants ce qui fait 3 bulbes perce que ils sont plus que 8 pas logique ni juste
        |                                               |             |       |   |
        |                                               |             |   "perce" exists, "que" exists but "perce que" is 
        |                                               |             |   not a bigram in the dict it should be "parce que"
        |                                               |             |
        |                                               |       Not the good word but it's ok, i will check with custom logic
        |                                          Perfect
      "je" and "peut" are valid unigrams but "je peut" is not a bigram, it should be "je peux" which is in the bigrams.   

Thanks again for your time,

Have a great day

wolfgarbe commented 3 years ago

If you attach the French frequency dictionary and the bigram dictionary files to the issue in plain text format, I could have a look what goes wrong (in SymSpell and/or the port)

ierezell commented 3 years ago

Here are the bigrams and unigrams dictionaries and obtained with the script bellow. bigram.txt unigram.txt

Note that the extension is .txt because github don't allow posting .py files. I took only the most recent count for each uni or bigrams. Also I limited to the 80 000 most frequent unigrams and 160 000 bigrams.
google_ngrams.txt

Thanks again a lot for your help !

ierezell commented 3 years ago

Sorry to bump again but I played with it more to make it work as the english version.

I tried to put space in random places and I realized my first bigrams and unigrams version was too loose, but I made it a bit more strict (no bi-grams of space + word or word + space and only word of at least one character which is in a french dictionnary)

Even with that I cannot get the exemple above to work but it fixed most of the random spaces/random splitting errors.

When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in.

Have a great week

wolfgarbe commented 3 years ago

When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in.

That's great.

I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects.

ierezell commented 3 years ago

Hi @wolfgarbe,

I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects.

Don't worry and thanks a lot for your time and dedication, it's really nice !

I will think a bit more about how to clean and collect data for other languages like russian which have one characters words or chineese...

Also another related sentence :

"when will she arrive" : quand va-t-elle arriver was written quand va telleariver and corrected with quand va telle river My problem is that the word arriver is more frequent than river and I thought it would be corrected with that. Also elle arriver is a bigram,telle river is not and correcting with quand va elle arriver would be perfect

Else it's harder to retrieve the real sentence (with a language model for exemple).

For now SymSpell was the most complete spellchecker for my need but I will maybe add a phonetic or POS layer (chatbot text is really awfully spelled). Do you plan to have this kind of improvements ?

Have a great day

wolfgarbe commented 3 years ago

I will maybe add a phonetic or POS layer. Do you plan to have this kind of improvements ?

Implementing a weighted edit distance giving a higher rank to character pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound) would certainly be a good improvement. But I don't think that I will find time to implement this near-term.

But there are at least two SymSpell ports who have already implemented a weighted edit distance:

https://github.com/MighTguY/customized-symspell https://github.com/searchhub/preDict

ierezell commented 3 years ago

Hi @wolfgarbe, sorry to bump this thread again... Do you have any news about the possible improvements on the bigrams corrections ?

I would be glad to help if some contribution is possible or needed.

Have a great day

wolfgarbe commented 3 years ago

Unfortunately, I have not yet found the time, but it is still on my mind.

ierezell commented 3 years ago

Hi @wolfgarbe, I'm really sorry to bump again... I'm sure you have tons on your hands, so could you point me the good place in the code so I can debug this and do a PR fix ?

It's starting to be urgent for me so I will put some hours in it :)

Thanks in advance and thanks again for this great library ! Have a great day

ierezell commented 2 years ago

Hello @wolfgarbe, as I also raised the issue on symspellpy, we might have found were it came from and it could be a fix.

https://github.com/mammothb/symspellpy/issues/107