Closed wkhunter closed 6 years ago
tesla p40, 20GB, train speed is very quick, but the validate is very slow.
validate.py was mostly intended for a quick way to see what the model is producing as output, beyond the summary statistics produced by test.py
. In particular, a feed_dict mechanism is used so one should not expect high throughput (though that doesn't seem to be your problem).
By 5000+ chars, do you mean that's how many output classes there are for CTC? That's a huge amount and I don't have any experience with models that wide.
Looking at the code, there are two possible suggestions I can make
I noticed that the image preprocessing happens on the CPU. Move the with tf.device(FLAGS.device):
above the assignments to proc_image
to include those on the GPU.
The beam width of the CTC beam search might be too large for a model with 5000 output classes and a sequence as long as yours (roughly 280/2=140 timesteps). Make the beam_width
parameter to the tf.nn.ctc_beam_search_decoder
smaller (the hard-coded value is 128.
You probably also want to run a python and perhaps tensorflow profiler to see exactly where the slowdown is.
I'll be glad to hear of any results you find.
i have tried as your suggestion. and yes, i have 5000+ classes, very huge. 1、Move the with tf.device(FLAGS.device): above the assignments to proc_image to include those on the GPU. 2、beam_width=1,#128, merge_repeated=False it's the CTC problem, now one picture need less than 1s, very quickly. now i try to find a way to make acu get better, thank you.
Thanks for reporting back.
With beam_width=1, you've basically got a greedy search and you may as well use tf.nn.ctc_greedy_decoder
. You should try to use the highest beam_width
you can tolerate for the best accuracy.
@wkhunter Could you please specify the version of tensorflow and tensorflow-gpu you used to train?
I have been using tensorflow=1.14.0 and tensorflow-gpu=1.14.0 but nvidia-smi shows only 56mb of the 12 GBs on my GPU is being utilised. I have a Tesla K80(12GB)
@SnehalRaj if you're asking about testing, most of the time will be taken up by running the decoder on the logits produced. However, I'm fairly certain TensorFlow doesn't release the memory it's allocated once you invoke the model for a forward pass in testing. (And I've never seen it that low in training either.)
Most recently I'd been running training on a local compile of tf 1.12.0 with CUDA 9.2 and CUDNN 7.1.4.
Thanks @weinman for the info.
After about 1 day of struggle, I finally got it to work. The problem being that my device was configured to work with python 3 and as your code worked for python 2, the tensorflow-gpu, cuda and cudnn weren't compatible.
Also, for anyone in the future, I would highly recommend creating a conda virtual environment with python 2.7 and then installing tensorflow-gpu 1.12.0 by using the commands
conda create -n yourenvname python=2.7
conda install -c anaconda tensorflow-gpu==1.12.0
Conda would do the rest and everything works seamlessly
Thanks for the summary and conclusion. I'll emphasize Python2.7 in the README.md (also: note again a call for PRs with any necessary updates for Python3.)
one picture one time needs 30 seconds, -- validate.py picture is 32*280 around, 5000+ chars, 200mb model size. how to speed up? 30 seconds is too long.