wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Issue parsing hyphenated words or those containing an underscore. #85

Closed krasi0 closed 4 years ago

krasi0 commented 4 years ago

Shouldn't the regex on https://github.com/wolfgarbe/SymSpell/blob/13bdc134573a14cf05bc06cb7817a7ce7b9a9af4/SymSpell/SymSpell.cs#L727 be changed from @"['’\w-[]]+" to @"['’\w-]+" so that combined words (which have been added to the dictionary) like decision-making or in_vitro stay together? Are there any drawbacks to that change?

wolfgarbe commented 4 years ago

You could do this, but then you have to make sure to use a dictionary that contains ALL valid compound words, both in first-second and in first_second form. Otherwise any compound word that is not in the dictionary, but you are anyway trying to check/correct will not be recognized as correct word and you will get strange correction suggestions.

In the current code the idea is to always split compounds both for the dictionary and the input terms and to spell correct the parts separately.