xdrop / fuzzywuzzy

Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm. Fuzzy search for Java
GNU General Public License v2.0
815 stars 117 forks source link

Difference between java and python implementation: Spoiler, the problem is the round #99

Open timotta opened 2 years ago

timotta commented 2 years ago

Here is the example where I was stuck. The python implemention gets 22, and the java implementation gets 23:

fuzz.token_set_ratio(
  "Vêndo ou troco por outro carro pode ser atrasado negócio volta ",
  "Titan 150 ano 2005 ", 
  False
)
FuzzySearch.tokenSetRatio(
  "Vêndo ou troco por outro carro pode ser atrasado negócio volta", 
  "Titan 150 ano 2005"
);

Debugging both code I could find that the problem is when rounding the value: 22.5

Python code, located in utils.py:

int(round(n))

Java code, located in SimpleRatio class is:

(int) Math.round(100 * DiffUtils.getRatio(s1, s2));

TLDR:

Java: Math.round(22.5) => 23 Python: round(22.5) => 22

Don't know which one is correct for this algorithm...

timotta commented 2 years ago

For now, I'm working with this local solution to have equivalence:

import me.xdrop.diffutils.DiffUtils;
import me.xdrop.fuzzywuzzy.ratios.SimpleRatio;

public class SimplePythonRatio extends SimpleRatio {
    @Override
    public int apply(String s1, String s2) {
        double ratio = 100 * DiffUtils.getRatio(s1, s2);
        double floored = Math.floor(ratio);
        double pythonRatio = ratio - floored <= 0.5 ? floored : Math.round(ratio);
        return (int) pythonRatio;
    }
}
new TokenSet().apply(s1, s2, new SimplePythonRatio());

What do you think of adding this to the java lib?

timotta commented 2 years ago

@leafah warned about HALF_EVEN python round behaviour ( https://en.wikipedia.org/wiki/Rounding#Round_half_to_even ), so the round(22.5) give us 22, and round(21.5) also returns 22. Java have this round implemented in BigDecimal:

import me.xdrop.diffutils.DiffUtils;
import me.xdrop.fuzzywuzzy.ratios.SimpleRatio;

import java.math.BigDecimal;
import java.math.RoundingMode;

public class SimplePythonRatio extends SimpleRatio {
    @Override
    public int apply(String s1, String s2) {
        double ratio = 100 * DiffUtils.getRatio(s1, s2);
        return (int) bankerRound(ratio);
    }

    // https://en.wikipedia.org/wiki/Rounding#Round_half_to_even
    double bankerRound(double value) {
        return new BigDecimal(value)
                .setScale(0, RoundingMode.HALF_EVEN)
                .doubleValue();
    }
}
burdoto commented 2 years ago

Honestly at this point I'm kinda confused regarding whats the deal here. If Python rounds 22.5 to 22 by default, then that's a Python issue (and honestly imo, yet another reason not to use Python at all). Resolving algebraic problems is really not the task of this library, thus such an implementation would be far out of scope (plus there is a workaround in Java's stdlib already).

I'm still interested in what @xdrop says about this, but I doubt it would be much different from this.