wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.15k stars 298 forks source link

[Question] Few queries regarding upcoming changes, benchmarking spell check and dictionary formation etc. #87

Open sai-prasanna opened 4 years ago

sai-prasanna commented 4 years ago

Thanks for maintaining this excellent library. I am currently contributing to the rust clone of this. I have a few queries.

  1. In the following upcoming change. Can you elaborate on how the pigeon hole principle is used, what algorithm is used for it?

Utilizing the pigeonhole principle by partitioning both query and dictionary terms will result in 5x less memory consumption and 3x faster precalculation time.

Reducing memory would really help for webassembly usage in the browser.

  1. To have maximum matches in the current frequency dictionary, what tokenizer (or just regex split) would you suggest to use?

  2. For languages not in SCOWL, what would be an optimal way to form dictionaries?

  3. In Jamspell, the author uses a 3 gram language model for re-ranking, is that on the roadmap and do you think it would bring in enough improvement that justifies it's use?

  4. How to benchmark spell correction? Is synthetic benchmark on similarly tokenized dataset with random errors introduced a good way about it (when human errors are not present)

  5. What are the metrics relevant for the real world (accuracy, false positives, precision etc) while benchmarking on such synthetic datasets?

wolfgarbe commented 4 years ago

In the following upcoming change. Can you elaborate on how the pigeon hole principle is used, what algorithm is used for it?

Current approach: Prefix Currently, we take a prefix of fixed length (usually 5..7) from the term (both dictionary and input term) From that prefix we create recursively all possible 1...n-deletes (edits, where you delete a single char at the first position, delete a single char at the second position, ... delete a single char at last position) Recursively we do the same with the resulting deletes, until n (= maximum edit distance) chars are deleted from the prefix. Then we try to match the deletes from the input term with the precalculated deletes in the dictionary

Future approach: Pigeonhole partitioning We are using the pigeonhole principle: A term (both dictionary and input term) is partitioned in as many (e.g. equal) parts, that there is at least one part, that has a maximum edit distance with not more than 1. Example: If a term has a maximum edit distance of 3, and we split it into two parts, then one of the two parts has a maximum distance of only 1. Knowing this, we create only 1-deletes. As there are much less 1-deletes than 3-deletes, this takes much less space and is much faster (fewer deletes to check AND the calculation of the Levenshtein edit distance with maximum edit distance of 1 is much faster than with maximum edit distance of 3). The actual implementation gets a bit more complicated, but I have a working prototype that confirmed the principle.

For languages not in SCOWL, what would be an optimal way to form dictionaries?

Frequency dictionaries in many other languages can be found at the FrequencyWords repository

You can also use Hunspell dictionaries and combine them with Google Books Ngram data which provides representative word frequencies.

In Jamspell, the author uses a 3 gram language model for re-ranking, is that on the roadmap and do you think it would bring in enough improvement that justifies it's use?

See my comment on the Jamspell repository: https://github.com/bakwc/JamSpell/issues/15#issuecomment-379892884

How to benchmark spell correction? Is synthetic benchmark on similarly tokenized dataset with random errors introduced a good way about it (when human errors are not present)

A synthetic benchmark is good for performance comparison. For measuring the correction quality using real world data is much more adequate. But the correction quality of SymSpell is currently anyway limited. The idea of SymSpell was to showcase a very fast algorithm, and not to deliver a production-ready turn-key solution.

One could additionally implement a Weighted Edit distance giving a higher priority to pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). There are two Java SymSpell ports that implement a weighted edit distance: https://github.com/MighTguY/customized-symspell and https://github.com/searchhub/preDict

Using more context information with word embeddings and word vectors is another promising approach: https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26 https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79

What are the metrics relevant for the real world (accuracy, false positives, precision etc) while benchmarking on such synthetic datasets?

answers to be continued ....

sai-prasanna commented 4 years ago

@wolfgarbe This is awesome. Thanks for taking the time to answer this in a thorough fashion.

pz-eth commented 3 years ago

Sorry to respond to such an old question, but is there any update on the pigeonhole partitioning implementation? Am I correct in thinking your example only works for Levenshtein distance (i.e. no transpositions)?

wolfgarbe commented 3 years ago

No update on the pigeonhole partitioning implementation yet. SymSpell uses the Damerau-Levenshtein distance and therefore supports also the transposition of two adjacent characters.

pz-eth commented 3 years ago

Thanks for getting back to me (and for the whole SymSpell implementation).

Sorry, I meant your example of the pigeonhole partitioning above. For example if we have the word 'pigeonhole' and split it into 2 parts, 'pigeo' and 'nhole': if there are 3 edits that are either insert, delete or substitute then we can say that one partition will definitely have an edit distance of <=1. If, however, the characters on the boundary of the partitions are transposed, there could be the case that both partitions have an edit distance of 2. I believe that with a max edit distance of K, the word would have to be split into K+1 partitions to ensure that the one partition has a Damerau-Levenshtein distance <=1.

wolfgarbe commented 3 years ago

The actual implementation gets a bit more complicated

  1. When partitioning you have to make sure that the resulting parts are still sufficiently long so that they have a discriminatory power.
  2. As the words have a limited average length (English, French, Spanish and German are 5.10, 5.13, 5.22 and 6.26.) you can for most words only split them in two parts and only for very long words (which are also the expensive ones in terms of space consumption and edit distance calculation time) in more than two parts.
  3. If you have a higher edit distance than 3, e.g. 5 then the pigeonhole principle still applies when splitting in two parts. When splitting a word with edit distance 5 in two parts, than one parts has an maximum edit distance of two. This is again much better than five.

Adjacent transposition There are different ways to deal with this:

  1. Use Levenshtein distance, where adjacent transpositions are counted as 2 separate edits
  2. With Damerau-Levenshtein partitioning into 2 parts with adjacent transpositions still saves time and space:
    edit distance word       maximum edit distance part
    1                                   1
    2                                   2
    3                                   2
    4                                   3
    5                                   3
    6                                   4
  1. Instead of increasing the number of partitions like you proposed or increasing the maximum edit distance considered per part it is better to just vary the split position p. You try once with split position p and the second time with split position p+1. This increases time and space only by factor two instead of being quadratic.
jamestwebber commented 1 year ago

I'm very interested in the partitioning improvement, as I am trying to use the symspell algorithm to correct errors in quite long "words" (39 characters) and the performance and memory consumption really start to suffer there. These words are nearly random sequences (DNA), so the prefix assumption is not as useful.

I only need to worry about Levenshtein distance, not Damerau-Levenshtein, so a partial solution would still be quite useful.

I can sort of see how this would work but I'm curious about the complications you mentioned. I could try to implement it myself with a few pointers (likely as a fork of the Rust implementation)