Open sai-prasanna opened 4 years ago
In the following upcoming change. Can you elaborate on how the pigeon hole principle is used, what algorithm is used for it?
Current approach: Prefix Currently, we take a prefix of fixed length (usually 5..7) from the term (both dictionary and input term) From that prefix we create recursively all possible 1...n-deletes (edits, where you delete a single char at the first position, delete a single char at the second position, ... delete a single char at last position) Recursively we do the same with the resulting deletes, until n (= maximum edit distance) chars are deleted from the prefix. Then we try to match the deletes from the input term with the precalculated deletes in the dictionary
Future approach: Pigeonhole partitioning We are using the pigeonhole principle: A term (both dictionary and input term) is partitioned in as many (e.g. equal) parts, that there is at least one part, that has a maximum edit distance with not more than 1. Example: If a term has a maximum edit distance of 3, and we split it into two parts, then one of the two parts has a maximum distance of only 1. Knowing this, we create only 1-deletes. As there are much less 1-deletes than 3-deletes, this takes much less space and is much faster (fewer deletes to check AND the calculation of the Levenshtein edit distance with maximum edit distance of 1 is much faster than with maximum edit distance of 3). The actual implementation gets a bit more complicated, but I have a working prototype that confirmed the principle.
For languages not in SCOWL, what would be an optimal way to form dictionaries?
Frequency dictionaries in many other languages can be found at the FrequencyWords repository
You can also use Hunspell dictionaries and combine them with Google Books Ngram data which provides representative word frequencies.
In Jamspell, the author uses a 3 gram language model for re-ranking, is that on the roadmap and do you think it would bring in enough improvement that justifies it's use?
See my comment on the Jamspell repository: https://github.com/bakwc/JamSpell/issues/15#issuecomment-379892884
How to benchmark spell correction? Is synthetic benchmark on similarly tokenized dataset with random errors introduced a good way about it (when human errors are not present)
A synthetic benchmark is good for performance comparison. For measuring the correction quality using real world data is much more adequate. But the correction quality of SymSpell is currently anyway limited. The idea of SymSpell was to showcase a very fast algorithm, and not to deliver a production-ready turn-key solution.
One could additionally implement a Weighted Edit distance giving a higher priority to pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). There are two Java SymSpell ports that implement a weighted edit distance: https://github.com/MighTguY/customized-symspell and https://github.com/searchhub/preDict
Using more context information with word embeddings and word vectors is another promising approach: https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26 https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79
What are the metrics relevant for the real world (accuracy, false positives, precision etc) while benchmarking on such synthetic datasets?
answers to be continued ....
@wolfgarbe This is awesome. Thanks for taking the time to answer this in a thorough fashion.
Sorry to respond to such an old question, but is there any update on the pigeonhole partitioning implementation? Am I correct in thinking your example only works for Levenshtein distance (i.e. no transpositions)?
No update on the pigeonhole partitioning implementation yet. SymSpell uses the Damerau-Levenshtein distance and therefore supports also the transposition of two adjacent characters.
Thanks for getting back to me (and for the whole SymSpell implementation).
Sorry, I meant your example of the pigeonhole partitioning above. For example if we have the word 'pigeonhole' and split it into 2 parts, 'pigeo' and 'nhole': if there are 3 edits that are either insert, delete or substitute then we can say that one partition will definitely have an edit distance of <=1. If, however, the characters on the boundary of the partitions are transposed, there could be the case that both partitions have an edit distance of 2. I believe that with a max edit distance of K, the word would have to be split into K+1 partitions to ensure that the one partition has a Damerau-Levenshtein distance <=1.
The actual implementation gets a bit more complicated
- When partitioning you have to make sure that the resulting parts are still sufficiently long so that they have a discriminatory power.
- As the words have a limited average length (English, French, Spanish and German are 5.10, 5.13, 5.22 and 6.26.) you can for most words only split them in two parts and only for very long words (which are also the expensive ones in terms of space consumption and edit distance calculation time) in more than two parts.
- If you have a higher edit distance than 3, e.g. 5 then the pigeonhole principle still applies when splitting in two parts. When splitting a word with edit distance 5 in two parts, than one parts has an maximum edit distance of two. This is again much better than five.
Adjacent transposition There are different ways to deal with this:
edit distance word maximum edit distance part
1 1
2 2
3 2
4 3
5 3
6 4
I'm very interested in the partitioning improvement, as I am trying to use the symspell algorithm to correct errors in quite long "words" (39 characters) and the performance and memory consumption really start to suffer there. These words are nearly random sequences (DNA), so the prefix assumption is not as useful.
I only need to worry about Levenshtein distance, not Damerau-Levenshtein, so a partial solution would still be quite useful.
I can sort of see how this would work but I'm curious about the complications you mentioned. I could try to implement it myself with a few pointers (likely as a fork of the Rust implementation)
Thanks for maintaining this excellent library. I am currently contributing to the rust clone of this. I have a few queries.
Reducing memory would really help for webassembly usage in the browser.
To have maximum matches in the current frequency dictionary, what tokenizer (or just regex split) would you suggest to use?
For languages not in SCOWL, what would be an optimal way to form dictionaries?
In Jamspell, the author uses a 3 gram language model for re-ranking, is that on the roadmap and do you think it would bring in enough improvement that justifies it's use?
How to benchmark spell correction? Is synthetic benchmark on similarly tokenized dataset with random errors introduced a good way about it (when human errors are not present)
What are the metrics relevant for the real world (accuracy, false positives, precision etc) while benchmarking on such synthetic datasets?