Open ojomio opened 9 years ago
this means that the a
was taken from the beginning, and bc
in the middle; as far as I understand the documentation of the Levenshtein there is no guarantee or claim anywhere that a block needed to be the longest continuous; it is based on the editops calculations, that consider inserting'sfdsffdfd a'
identical in cost to adding 'asfdsffdfd '
to the beginning; the editops then says that one of these is chosen arbitrarily.
Problem could come from editops_from_cost_matrix
: The if statement from line 5691
https://github.com/ztane/python-Levenshtein/blob/3a7412f38f1991c20d7a2765f30c2bb9cb1e63e0/Levenshtein/_levenshtein.c#L5691-L5699 should be moved to be the first if statement (moved to line 5675).
Hope this helps
Can confirm that @BobLd's suggestion fixes the problem.
@BobLd this does not really fix the issue. It just shifts it, so it works for this explicit case. As an example
SequenceMatcher(None, "no", "bnonco").get_matching_blocks()
currently works, but breaks when moving the if statement to the top
As a note: This is not a bug, since python-Levenshtein does not provide any guarantee, that matching_blocks includes the longest common substring. It bases the matching blocks on one of possibly multiple optimal alignments. Finding this alignment would require a search through all optimal alignments, which is simply not done in the library. Note that this is stated in the documentation of opcodes:
The result is a list of 5-tuples with the same meaning as in SequenceMatcher's get_opcodes() output. But since the algorithms differ, the actual sequences from Levenshtein and SequenceMatcher may differ too
If I use this in python
list(SequenceMatcher(None, 'asfdsffdfd abcd', 'abcr').get_matching_blocks())
I get this result[(0, 0, 1), (12, 1, 2), (15, 4, 0)]
which is clearly wrong, because the longest match is (11, 0, 3), not (12, 1, 2)