ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.26k stars 155 forks source link

editops missing replace operations in some cases #73

Closed danylofitel closed 3 years ago

danylofitel commented 3 years ago

Sample strings: TCTTTGGAGCACAAAACCAGTTGAAACATCAAATTCGTTTGATGTACTGAAGTCAGAGGACGCGCAGGGA TCTTTGGAGCACAAAACCAGTTGAAACATCATTATTCCTTCGTTTGATGTACTGAAGTCAGAGGACGCGCAGGGA

In this case TTATT was inserted and AA was replaced with CC, but editops only returns one replace of C to A.

Expected output: [('insert', 31, 31), ('insert', 31, 32), ('insert', 32, 34), ('insert', 32, 35), ('insert', 32, 36), ('replace', 32, 37), ('replace', 33, 38)]

Actual output: [('insert', 31, 31), ('insert', 31, 32), ('insert', 32, 34), ('insert', 32, 35), ('insert', 32, 36), ('replace', 32, 37)]

maxbachmann commented 3 years ago

I do not see anything incorrect in this output. The Levenshtein distance between these two strings is 6 (similar to the count of editops). Why should we insert an A and then replace an A. The editops describe the following transformation:

AA
TTATTCC

2 insertions in the beginning

TTAA
TTATTCC

3 Insertions after the first A

TTATTCA
TTATTCC

Replacement of the last A

TTATTCC
TTATTCC

Note that editops will just return one of multiple optimal paths.

danylofitel commented 3 years ago

My bad, you're right, the output is completely correct. I was actually looking at the output formatted differently (nothing to do with the library, custom code), the bug was actually in that formatting. Thanks for checking this! I'll close the issue.