ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.26k stars 155 forks source link

Seems to give wrong answer for matching_blocks() #16

Open ojomio opened 9 years ago

ojomio commented 9 years ago

If I use this in python list(SequenceMatcher(None, 'asfdsffdfd abcd', 'abcr').get_matching_blocks()) I get this result [(0, 0, 1), (12, 1, 2), (15, 4, 0)] which is clearly wrong, because the longest match is (11, 0, 3), not (12, 1, 2)

ztane commented 9 years ago

this means that the a was taken from the beginning, and bc in the middle; as far as I understand the documentation of the Levenshtein there is no guarantee or claim anywhere that a block needed to be the longest continuous; it is based on the editops calculations, that consider inserting'sfdsffdfd a' identical in cost to adding 'asfdsffdfd ' to the beginning; the editops then says that one of these is chosen arbitrarily.

BobLd commented 4 years ago

Problem could come from editops_from_cost_matrix: The if statement from line 5691 https://github.com/ztane/python-Levenshtein/blob/3a7412f38f1991c20d7a2765f30c2bb9cb1e63e0/Levenshtein/_levenshtein.c#L5691-L5699 should be moved to be the first if statement (moved to line 5675). Hope this helps

burdoto commented 4 years ago

Can confirm that @BobLd's suggestion fixes the problem.

maxbachmann commented 4 years ago

@BobLd this does not really fix the issue. It just shifts it, so it works for this explicit case. As an example

SequenceMatcher(None, "no", "bnonco").get_matching_blocks()

currently works, but breaks when moving the if statement to the top

maxbachmann commented 2 years ago

As a note: This is not a bug, since python-Levenshtein does not provide any guarantee, that matching_blocks includes the longest common substring. It bases the matching blocks on one of possibly multiple optimal alignments. Finding this alignment would require a search through all optimal alignments, which is simply not done in the library. Note that this is stated in the documentation of opcodes:

The result is a list of 5-tuples with the same meaning as in SequenceMatcher's get_opcodes() output. But since the algorithms differ, the actual sequences from Levenshtein and SequenceMatcher may differ too