xdrop / fuzzywuzzy

Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm. Fuzzy search for Java
GNU General Public License v2.0
822 stars 118 forks source link

Results differ from python library #74

Closed daniel17903 closed 5 years ago

daniel17903 commented 5 years ago

Hi, while porting some python code to java I discovered that the Token Sort and Token Set Ratios calculated by this library oftentimes do not match the ones calculated by the python fuzzywuzzy library.

Here is an example: Python Code:

from fuzzywuzzy import fuzz 
print(str(fuzz.token_sort_ratio("efwe fwef","wef wefwef"))) 
print(str(fuzz.token_set_ratio("efwe fwef","wef wefwef"))) 

Output:

53
53

Java Code:

import me.xdrop.fuzzywuzzy.FuzzySearch;

public class Main {
    public static void main(String[] args) {
        System.out.println(FuzzySearch.tokenSortRatio("efwe fwef","wef wefwef"));
        System.out.println(FuzzySearch.tokenSetRatio("efwe fwef","wef wefwef"));
    }
}

Output:

84
84

Where is this difference coming from? Shouldn't these two outputs be equal?

xdrop commented 5 years ago

We only ported the python-levenshtein module and not the built-in python difflib (for speed). Are you using the Python library with the python-levenshtein module installed?

ie. instead of

pip install fuzzywuzzy

use

pip install fuzzywuzzy[speedup]
daniel17903 commented 5 years ago

Thanks. When installing fuzzywuzzy[speedup] the results match. I wasn't aware that using different libraries impacts the output.