wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Command line lookupcompound & wordsegment with MaxEditDistance=0 does not perform any segmentation. #42

Closed rockroland closed 6 years ago

rockroland commented 6 years ago

With this setting only the original word is returned in the output. With MaxEditDistance = 1 the results are segmented but with results having an edit distance of 1.

rockroland commented 6 years ago

can you confirm this?

wolfgarbe commented 6 years ago

No, I can't confirm this. The command line does segment with MaxEditDistance=0. But it does only segment if the words are already valid words and written correctly, and just the space between them is missing, Perhaps you were expecting that with MaxEditDistance=0 it would still segment words that are not written correctly, just leaving the error in. That is impossible. To find the word boundaries we first need to find the correct word. And to find a correct word for an incorrecect input we need a MaxEditDistance>0.

MaxEditDistance=0 dotnet SymSpell.CommandLine.dll load frequency_dictionary_en_82_765.txt lookupcompound 0 < lookupcompound_input.txt > lookupcompound_output.txt

lookupcompound_input.txt: expensivehouse expensivahouse

lookupcompound_output.txt: expensive house expensivahouse

MaxEditDistance=1 dotnet SymSpell.CommandLine.dll load frequency_dictionary_en_82_765.txt lookupcompound 1 < lookupcompound_input.txt > lookupcompound_output.txt

lookupcompound_input.txt: expensivehouse expensivahouse

lookupcompound_output.txt: expensive house expensive house

rockroland commented 6 years ago

For some reason it does not work with my dictionary of names+frequencies. I checked my dictionary to see if the compound names in my input file were listed "as-is" in the dictionary and therefore were assumed to be spelled correctly and no split was necessary. However, they were not in the dictionary. Examples would be "christinamaria" or "jamesmonroe". BUT, these compound names were properly split using your default dictionary. So by using my dictionary the splitting of ANY compound name I try does not work. This is my dictionary txt file (zipped): https://mega.nz/#!vBEC1YYS!oL0C9ZVLkvNHw8hoxzr9AO8-aX-5ggbG_KGKBH9wwy0 It is ~4M names delimited by space along with the integer frequency Can you confirm the issue or recommend a way forward?

wolfgarbe commented 6 years ago

I think the problem is that the terms in your dictionary start with an upper-case letter.

But SymSpell expects both dictionary terms and input term to be in lower case. https://github.com/wolfgarbe/SymSpell#dictionary-file-format

If you convert your dictionary to lower-case it should work.

rockroland commented 6 years ago

thanks!