weinman / cnn_lstm_ctc_ocr

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR
GNU General Public License v3.0
498 stars 170 forks source link

Train on more characters? #17

Closed hiepph closed 6 years ago

hiepph commented 6 years ago

I want to recognize more than just English alphabet and numbers (e.g. special Unicode characters). Is this possible and how can I do this?

Suppose I have my own dataset, do I have to write my own data loader and provide

out_charset="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"

like in your src/mjsynth.py

weinman commented 6 years ago

The model itself doesn't use any particular alphabet. You simply need a way to map the labels to a set of consecutive positive integers (that's how the ctc layer works in tensorflow).

As you noted, I do this in src/mjsynth.py by constructing a single string and then using the characters' indices. For other unicode characters, you'd want to make sure the use of string.index as in the data generator works for them.