weinman / cnn_lstm_ctc_ocr

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR
GNU General Public License v3.0
498 stars 170 forks source link

Model learns nothing about certain characters. #59

Closed keeofkoo closed 4 years ago

keeofkoo commented 5 years ago

I was planning to adapt the architecture to recognize a large set of characters, like Japanese and Chinese, but found out the model does not learn anything about a certain set of characters, of which some are even among the most used ones. I trained the model with a dataset of 120k+ cropped words (which involves roughly 3k characters) for 500k steps and got a loss around 5. I have checked that the cannot-be-learned characters are indeed in both training and validation datasets. I printed out the intermediate logits for debugging, only to find that they are the save value (like, [3.2362458e-4, 3.2362458e-4, ... 3.2362458e-4]), meaning the model has no clue which class it should fall into. There are roughly 100 characters that are in this case, while the rest (the majority) seem to be fine. I have also referred to #42, and tried following the training schedule, but all I got is a new set of cannot-be-learned characters.

weinman commented 5 years ago

This isn't an error or bug in the code (hence cause for an issue), but the behavior you're experiencing with a certain use case. You should open a thread on stackoverflow about training such models in those cases.

It can perhaps be hard for the LSTM to get "off the ground". Perhaps you should try pre-training a simple CNN classifier and importing those weights to the full model.