weinman / cnn_lstm_ctc_ocr

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR
GNU General Public License v3.0
497 stars 170 forks source link

error with the mjsynth-tfrecord.py file #28

Closed kai-kaushik closed 5 years ago

kai-kaushik commented 6 years ago

I downloaded the mjsynth dataset separately and stored the images in the image subpath under the data directory. Basically, I did everything manually up until the "make mjsynth-tfrecord.py" command. When i ran the command, it showed me a syntax error in the print line in this line from the mjsynth-tfrecord.py file.

    print str(i),'of',str(num_shards),'[',str(start),':',str(end),']',out_filename
    gen_shard(sess, input_base_dir, image_filenames[start:end], out_filename)
# Clean up writing last shard
start = num_shards*images_per_shard
out_filename = output_filebase+'-'+(shard_format % num_shards)+'.tfrecord'
print str(i),'of',str(num_shards),'[',str(start),':]',out_filename
gen_shard(sess, input_base_dir, image_filenames[start:], out_filename) 

since i am using python 3.6, I thought the problem is the absence of opening and closing brackets in the print line, hence i changed it to this...

    print (str(i),'of',str(num_shards),'[',str(start),':',str(end),']',out_filename)
    gen_shard(sess, input_base_dir, image_filenames[start:end], out_filename)
# Clean up writing last shard
start = num_shards*images_per_shard
out_filename = output_filebase+'-'+(shard_format % num_shards)+'.tfrecord'
print (str(i),'of',str(num_shards),'[',str(start),':]',out_filename)
gen_shard(sess, input_base_dir, image_filenames[start:], out_filename)

And the program started runnig, but Im seeing a lot of files read a error corrosponding to this line

    except:
        # Some files have bogus payloads, catch and note the error, moving on
        print('ERROR',filename)

Can anyone tell me why this is happening? Thankyou for the help in advance.

weinman commented 6 years ago

Thanks for the note! If you find enough other places that vary significantly for Python 3, I'd be happy to have a separate branch that contains updates for Python3 , and I'd merge it with master if it works in both Python2 and Python3.

In any case, several of the jpg files in the raw mjsynth archive are just garbage. You can verify this by trying to load them in any image viewer (they might be truncated, but they tend to be only a handful of bytes relative to the valid images).

I don't know why that is, but the way the TFRecord encoder handles this is to detect the exception inevitably thrown by the image file decoder and let you know know about it whilst moving on to another example.

kai-kaushik commented 6 years ago

Thanks for the prompt reply, the problem is, all I see are errors when reading the files. I'll try once in python 2 and see if I get the same error.

Hust-ShuaiWang commented 5 years ago

Did you run this code on a Windows system?

weinman commented 5 years ago

@Hust-ShuaiWang I'm not sure whether you're asking me or @Kumara-Kaushik, but I can tell you that I did not run any of this repo on a Windows system. Are you suggesting that the problems with bad input data do not occur on a Windows-based file system?

Hust-ShuaiWang commented 5 years ago

I am asking for @Kumara-Kaushik .I have met the same error when I run this repo on a Windows system.The reason for this problem is that the file has different storage formats on WINDOWS and LINUX.So you have to change the way you read the file.Just change the "with tf.gfile.FastGFile( filename, 'r' ) as f:" to "with tf.gfile.FastGFile( filename, 'rb' ) as f:" (line 133 in mjsynth-tfrecord.py).Detailed error reason, you have to find relevant information yourself, not difficult

weinman commented 5 years ago

@Hust-ShuaiWang Thanks for the report. I'll try out that change on Linux, and if it works there too, I will commit it.

david-morris commented 5 years ago

@weinman I can confirm @Hust-ShuaiWang 's technique. I think it works because the 'b' signifies bytes, and the default file opening mode is as text now.

weinman commented 5 years ago

Updated in defd8ae