wincentbalin / pytesstrain

Python tools for Tesseract OCR training
https://pypi.org/project/pytesstrain/
Apache License 2.0
25 stars 7 forks source link

evaluate CER/WER of existing model on ground truth/training data #2

Closed whisere closed 2 years ago

whisere commented 2 years ago

Is there scripts or easy way to evaluate CER/WER of existing model eg eng.traineddata and other traineddata models on user created ground truth/training data to compare? 'language_metrics -l lang -w lang.wordlist --fonts Arial,Courier' is over images of random word sequences on specific wordlist and fonts only. Thanks.

wincentbalin commented 2 years ago

Most traineddata models contain a wordlist (as a dawg file), which you can use with the language_metrics tool. Unpack the model with the combine_tessdata utility (part of Tesseract), and use the dawg2wordlist (also part of Tesseract) to convert the dawg file to a wordlist.

Or replace the code in language_metrics, that creates word sequences, by a code that reads a known ground truth file used for training and selects random line as a text to evaluate.

Alternatively, if the model you would like to test is a legacy model, you might extract the bigram file, load it into the Markov Chain, and use the chain to generate text for language_metrics.

whisere commented 2 years ago

Thanks! I also found command lstmeval works with exiting ground truth evaluation for models, and outputing CER and WER.

wincentbalin commented 2 years ago

Then I am considering this issue as solved.