wincentbalin / pytesstrain

Python tools for Tesseract OCR training
https://pypi.org/project/pytesstrain/
Apache License 2.0
25 stars 7 forks source link

Feature Request: Do not include punctuation in wordlist #1

Closed Shreeshrii closed 4 years ago

Shreeshrii commented 4 years ago

The wordlists in tesseract-ocr/langdata do not have punctuation marks (except for - dash).

create_dictdata creates a wordlist but it has the punctuation marks. Example:

Prasamar.
ṭyul(tana)
‘pertaining
to’;
Nandaka
auxiliary
deities.
Bhānugupta,

Request wordlists to not include punctuation marks. Thanks!

wincentbalin commented 4 years ago

Hello Shree, I changed the default behaviour of create_dictdata.py to remove punctuation.

Shreeshrii commented 4 years ago

Thank you so much for your prompt response. Working great!!