wincentbalin / pytesstrain

Python tools for Tesseract OCR training
https://pypi.org/project/pytesstrain/
Apache License 2.0
25 stars 7 forks source link

metrics: avoid CER > 1.0 #3

Open bertsky opened 2 years ago

bertsky commented 2 years ago

https://github.com/wincentbalin/pytesstrain/blob/b6a85dec3a02b878f8cee7d8170a75e7dabaeca6/pytesstrain/metrics/cer.py#L6

This definition is common, but flawed IMHO: the numerator being a Levenshtein distance, i.e. a sum of costs along a path through the confusion matrix, the natural denominator for that is the length of that path. (Of course, the editdistance package does not yield the actual alignment path, so you'll have to use a different library, like difflib.SequenceMatcher or rapidfuzz.levenshtein_editops).

For some discussion, see here and here.

Perhaps the different definitions (gt-ref / max-ref / pathlen) could be made optional?

wincentbalin commented 2 years ago

As I do not have much time to solve this, would you like to contribute a solution?

bertsky commented 2 years ago

I would indeed – just give me some time.