validate.py speed problem

wkhunter commented 6 years ago

one picture one time needs 30 seconds, -- validate.py picture is 32*280 around, 5000+ chars, 200mb model size. how to speed up? 30 seconds is too long.

wkhunter commented 6 years ago

tesla p40, 20GB, train speed is very quick, but the validate is very slow.

weinman commented 6 years ago

validate.py was mostly intended for a quick way to see what the model is producing as output, beyond the summary statistics produced by test.py. In particular, a feed_dict mechanism is used so one should not expect high throughput (though that doesn't seem to be your problem).

By 5000+ chars, do you mean that's how many output classes there are for CTC? That's a huge amount and I don't have any experience with models that wide.

Looking at the code, there are two possible suggestions I can make

I noticed that the image preprocessing happens on the CPU. Move the with tf.device(FLAGS.device): above the assignments to proc_image to include those on the GPU.
The beam width of the CTC beam search might be too large for a model with 5000 output classes and a sequence as long as yours (roughly 280/2=140 timesteps). Make the beam_width parameter to the tf.nn.ctc_beam_search_decoder smaller (the hard-coded value is 128.

You probably also want to run a python and perhaps tensorflow profiler to see exactly where the slowdown is.

I'll be glad to hear of any results you find.

wkhunter commented 6 years ago

i have tried as your suggestion. and yes, i have 5000+ classes, very huge. 1、Move the with tf.device(FLAGS.device): above the assignments to proc_image to include those on the GPU. 2、beam_width=1,#128, merge_repeated=False it's the CTC problem, now one picture need less than 1s, very quickly. now i try to find a way to make acu get better, thank you.

weinman commented 6 years ago

Thanks for reporting back.

With beam_width=1, you've basically got a greedy search and you may as well use tf.nn.ctc_greedy_decoder. You should try to use the highest beam_width you can tolerate for the best accuracy.

SnehalRaj commented 5 years ago

@wkhunter Could you please specify the version of tensorflow and tensorflow-gpu you used to train?

I have been using tensorflow=1.14.0 and tensorflow-gpu=1.14.0 but nvidia-smi shows only 56mb of the 12 GBs on my GPU is being utilised. I have a Tesla K80(12GB)

weinman commented 5 years ago

@SnehalRaj if you're asking about testing, most of the time will be taken up by running the decoder on the logits produced. However, I'm fairly certain TensorFlow doesn't release the memory it's allocated once you invoke the model for a forward pass in testing. (And I've never seen it that low in training either.)

Most recently I'd been running training on a local compile of tf 1.12.0 with CUDA 9.2 and CUDNN 7.1.4.

SnehalRaj commented 5 years ago

Thanks @weinman for the info.

After about 1 day of struggle, I finally got it to work. The problem being that my device was configured to work with python 3 and as your code worked for python 2, the tensorflow-gpu, cuda and cudnn weren't compatible.

Also, for anyone in the future, I would highly recommend creating a conda virtual environment with python 2.7 and then installing tensorflow-gpu 1.12.0 by using the commands conda create -n yourenvname python=2.7 conda install -c anaconda tensorflow-gpu==1.12.0

Conda would do the rest and everything works seamlessly

weinman commented 5 years ago

Thanks for the summary and conclusion. I'll emphasize Python2.7 in the README.md (also: note again a call for PRs with any necessary updates for Python3.)

weinman / cnn_lstm_ctc_ocr

validate.py speed problem #31