wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Search Perceptual Hash #79

Closed Phalelashvili closed 4 years ago

Phalelashvili commented 4 years ago

Is SymSpell capable of searching through fuzzy/perceptual hashes? for example:

0x88006FFF2D9B9D739B7F8A8C7F8E828989858D89858A868888878A88888887898888898887888788

match this hash to

0x88006FFF2C9C9D739B7F8A8C7F8E828989858D89868A868888878A88888987898888898887888788

wolfgarbe commented 4 years ago

Yes.

`

        //set parameters
        const int initialCapacity = 82765;
        const int maxEditDistance = 4;
        //up to length of longest entry: terms should be sufficiently distinguishable by within prefix (on average, occassional terms with identic prefix are allowed);
        const int prefixLength = 7;
        var symSpell = new SymSpell(initialCapacity, maxEditDistance, prefixLength);
        symSpell.CreateDictionaryEntry("0x88006FFF2D9B9D739B7F8A8C7F8E828989858D89858A868888878A88888887898888898887888788",1);

        List<SymSpell.SuggestItem> suggestions = null;

        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();

        //check if input term or similar terms within edit-distance are in dictionary, return results sorted by ascending edit distance, then by descending word frequency     
        const SymSpell.Verbosity verbosity = SymSpell.Verbosity.Closest;
        suggestions = symSpell.Lookup("0x88006FFF2C9C9D739B7F8A8C7F8E828989858D89868A868888878A88888987898888898887888788", verbosity);

        stopWatch.Stop();
        Console.WriteLine(stopWatch.Elapsed.TotalMilliseconds.ToString() + " ms");

        //display term and frequency
        foreach (var suggestion in suggestions)
        {
            Console.WriteLine(suggestion.term + " " + suggestion.distance.ToString() + " " + suggestion.count.ToString("N0"));
        }

`

Returns:

` 0x88006FFF2D9B9D739B7F8A8C7F8E828989858D89858A868888878A88888887898888898887888788 4 1 in 6,3286 ms

`

Alternatively you could also use a technique called SimHash: https://moz.com/devblog/near-duplicate-detection

Phalelashvili commented 4 years ago

Yes.

`

        //set parameters
        const int initialCapacity = 82765;
        const int maxEditDistance = 4;
        //up to length of longest entry: terms should be sufficiently distinguishable by within prefix (on average, occassional terms with identic prefix are allowed);
        const int prefixLength = 7;
        var symSpell = new SymSpell(initialCapacity, maxEditDistance, prefixLength);
        symSpell.CreateDictionaryEntry("0x88006FFF2D9B9D739B7F8A8C7F8E828989858D89858A868888878A88888887898888898887888788",1);

        List<SymSpell.SuggestItem> suggestions = null;

        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();

        //check if input term or similar terms within edit-distance are in dictionary, return results sorted by ascending edit distance, then by descending word frequency     
        const SymSpell.Verbosity verbosity = SymSpell.Verbosity.Closest;
        suggestions = symSpell.Lookup("0x88006FFF2C9C9D739B7F8A8C7F8E828989858D89868A868888878A88888987898888898887888788", verbosity);

        stopWatch.Stop();
        Console.WriteLine(stopWatch.Elapsed.TotalMilliseconds.ToString() + " ms");

        //display term and frequency
        foreach (var suggestion in suggestions)
        {
            Console.WriteLine(suggestion.term + " " + suggestion.distance.ToString() + " " + suggestion.count.ToString("N0"));
        }

`

Returns:

` 0x88006FFF2D9B9D739B7F8A8C7F8E828989858D89858A868888878A88888887898888898887888788 4 1 in 6,3286 ms

`

Alternatively you could also use a technique called SimHash: https://moz.com/devblog/near-duplicate-detection

is there more efficient way to add millions of entries from file? or do i just have to iterate through lines, it's taking very long

edit: even staging is slow, or i couldn't figure it out correctly

wolfgarbe commented 4 years ago

You have have to iterate through the lines and add each entry to the dictionary. The creation of the dictionary takes long, the search is fast. That's the trade-off.

Phalelashvili commented 4 years ago

You have have to iterate through the lines and add each entry to the dictionary. The creation of the dictionary takes long, the search is fast. That's the trade-off.

Can I dump dictionary to file so i can load it later without recreating it? i found functions that load from file but couldn't find dump.

wolfgarbe commented 4 years ago

This is currently not implemented. You would have to code this yourself.

The idea was that you start your server only once a week or month and therefore the recreation of the dictionary is not a problem.

Phalelashvili commented 4 years ago

This is currently not implemented. You would have to code this yourself.

The idea was that you start your server only once a week or month and therefore the recreation of the dictionary is not a problem.

I have 100 million of these hashes, is it even practical to use SymSpell with maxEditDistance of 10?

wolfgarbe commented 4 years ago

1 million with edit distance 4 could be ok, 100 million with edit distance 10 probably not on a single server. You could distribute/partition it over multiple servers.

Or try SimHash: https://moz.com/devblog/near-duplicate-detection

Phalelashvili commented 4 years ago

1 million with edit distance 4 could be ok, 100 million with edit distance 10 probably not on a single server. You could distribute/partition it over multiple servers.

Or try SimHash: https://moz.com/devblog/near-duplicate-detection

I'll take a look at SimHash, thanks for help.