Closed rockroland closed 6 years ago
can you confirm this?
No, I can't confirm this. The command line does segment with MaxEditDistance=0. But it does only segment if the words are already valid words and written correctly, and just the space between them is missing, Perhaps you were expecting that with MaxEditDistance=0 it would still segment words that are not written correctly, just leaving the error in. That is impossible. To find the word boundaries we first need to find the correct word. And to find a correct word for an incorrecect input we need a MaxEditDistance>0.
MaxEditDistance=0
dotnet SymSpell.CommandLine.dll load frequency_dictionary_en_82_765.txt lookupcompound 0 < lookupcompound_input.txt > lookupcompound_output.txt
lookupcompound_input.txt: expensivehouse expensivahouse
lookupcompound_output.txt: expensive house expensivahouse
MaxEditDistance=1
dotnet SymSpell.CommandLine.dll load frequency_dictionary_en_82_765.txt lookupcompound 1 < lookupcompound_input.txt > lookupcompound_output.txt
lookupcompound_input.txt: expensivehouse expensivahouse
lookupcompound_output.txt: expensive house expensive house
For some reason it does not work with my dictionary of names+frequencies. I checked my dictionary to see if the compound names in my input file were listed "as-is" in the dictionary and therefore were assumed to be spelled correctly and no split was necessary. However, they were not in the dictionary. Examples would be "christinamaria" or "jamesmonroe". BUT, these compound names were properly split using your default dictionary. So by using my dictionary the splitting of ANY compound name I try does not work. This is my dictionary txt file (zipped): https://mega.nz/#!vBEC1YYS!oL0C9ZVLkvNHw8hoxzr9AO8-aX-5ggbG_KGKBH9wwy0 It is ~4M names delimited by space along with the integer frequency Can you confirm the issue or recommend a way forward?
I think the problem is that the terms in your dictionary start with an upper-case letter.
But SymSpell expects both dictionary terms and input term to be in lower case. https://github.com/wolfgarbe/SymSpell#dictionary-file-format
If you convert your dictionary to lower-case it should work.
thanks!
With this setting only the original word is returned in the output. With MaxEditDistance = 1 the results are segmented but with results having an edit distance of 1.