weinman / cnn_lstm_ctc_ocr

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR
GNU General Public License v3.0
498 stars 170 forks source link

Question Regarding End Model #66

Closed tschamp31 closed 4 years ago

tschamp31 commented 4 years ago

Preface: So my background when coming into machine learning is mainly just understanding programming. Ive gained some solid knowledge about data framing etc but the math of neural networks is still a major weakness.

So my question I believe I already understand but I want to verify before I waste time on these efforts. 1.) The MJSYNTH dataset will only teach the model how to break down/identify/process singular wards?

2.) Assuming yes to (2); Which in result means it needs to be taught how to read sentence and paragraph/spacing structures, correct?

3.) Assuming yes to (2); Is that where your teams work on mapsynth came into play?

4.) Assuming yes to (3); Is the finished mjsynth model then trained on the mapsynth dataset or is it finetuned/scoped to the mapsynth dataset?

5.) Assuming yes to (4); What is the global_step/loss/learning rate etc that is ideal to train it on that dataset?

Yes/No should suffice for all 5. If no maybe a very short rationale. Thank you again for this project being public. Wish your team well on the 2019 ICDAR. I will also post a 1 million step model trained on a single GPU if your team would like a copy on hand or provide to the public.

weinman commented 4 years ago
  1. Yes, the mjsynth dataset contains only images of single words.
  2. Yes, training on mjsynth alone will not work well for segmenting words (i.e., with spaces).
  3. No, We created and used MapTextSynthesizer because the visual properties of MJSynth were not a good match for our application. By default, it also generates only images of single words, but with more complicated backgrounds and wider inter-character spacing (on average).
  4. For the results in our ICDAR'19 paper, we train the model from scratch solely on the MapTextSynthesizer stream.
  5. The training schedule we use is given in the paper (Table II) with average (per-word) loss on the real map and MJSynth data given in Figure 6. See #42 for some additional context/examples.

Indirectly, you could probably train a sequence recognizer with MapTextSynthesizer. You could generate a static list of captions (phrases) to sample from as if they were words (though I'm not sure whether the spaces would render properly, maybe @arthurhero knows), but the better thing to do would be to choose the random phrase dynamically on the fly, which would require some more substantial modifications to the code.

In either case, you could then use the CTCWordBeamSearch module in a multi-word mode to recognize the text (or plain Tensorflow CTC beam search if you don't want a lexicon). Just remember to include a space among the output characters in charset.py.

arthurhero commented 4 years ago
  1. Yes, the mjsynth dataset contains only images of single words.
  2. Yes, training on mjsynth alone will not work well for segmenting words (i.e., with spaces).
  3. No, We created and used MapTextSynthesizer because the visual properties of MJSynth were not a good match for our application. By default, it also generates only images of single words, but with more complicated backgrounds and wider inter-character spacing (on average).
  4. For the results in our ICDAR'19 paper, we train the model from scratch solely on the MapTextSynthesizer stream.
  5. The training schedule we use is given in the paper (Table II) with average (per-word) loss on the real map and MJSynth data given in Figure 6. See #42 for some additional context/examples.

Indirectly, you could probably train a sequence recognizer with MapTextSynthesizer. You could generate a static list of captions (phrases) to sample from as if they were words (though I'm not sure whether the spaces would render properly, maybe @arthurhero knows), but the better thing to do would be to choose the random phrase dynamically on the fly, which would require some more substantial modifications to the code.

In either case, you could then use the CTCWordBeamSearch module in a multi-word mode to recognize the text (or plain Tensorflow CTC beam search if you don't want a lexicon). Just remember to include a space among the output characters in charset.py.

The spacing should be fine. But since phrases tend to be longer than words, pay attention to the hard upper limit of the image width, which can be set in mts_texthelper.cpp at line 562:

surface = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, 40*height,height);

Currently the hard limit is 40 times the image height. You might want to set it higher for phrases.

tschamp31 commented 4 years ago

Perfect, that feedback was exactly what I needed. Thank you both, I will continue to update the code to a cleaner TF2.0. Ideally getting rid of all "tf.compat.vX".

As I said before thank you again for having this project public.