Very slow training speed. Is this expected for my system setup?

mrxiaohe commented 5 years ago

I am currently training a model using the Chinese in the Wild image data. My system setup is as follows:

OS: Windows Server 2016 Standard
RAM: 256 GB
Had drive: 6TB
Processor: Intel Xeon CPU E5-2687W v4 (24 cores)
GPU: NVIDIA Tesla V100-PCIE-16GB

The speed is shown below: Each step takes close to 30 seconds. The training has been running for 2 days, and it's only done 5410 steps, so far. It seems like GPU is getting utilized -- 96% of the GPU memory is used. The CPU also shows quite a bit of activity - e.g., 40% by the Python session in which the training is running.

Also, when I started training, I got the message failed to allocate 15.90G (17071144960 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY. Not sure if this is related

So my question is if the speed I am observing is normal for the kind of computer setup I have, and how I might improve the speed. Thanks!

INFO:tensorflow:Recording summary at step 5315.
INFO:tensorflow:Saving checkpoint to path E:\Projects\TEXT DETECTION\chinese_text_in_the_wild\ctw-baseline-master\classification\products\train_logs_alexnet_v2\model.ckpt
INFO:tensorflow:Recording summary at step 5319.
INFO:tensorflow:global step 5320: loss = 6.8224 (29.31 sec/step)
INFO:tensorflow:Recording summary at step 5323.
INFO:tensorflow:Recording summary at step 5327.
INFO:tensorflow:global step 5330: loss = 6.9395 (29.13 sec/step)
INFO:tensorflow:Recording summary at step 5331.
INFO:tensorflow:Recording summary at step 5335.
INFO:tensorflow:Recording summary at step 5339.
INFO:tensorflow:global step 5340: loss = 6.7953 (34.16 sec/step)
INFO:tensorflow:Recording summary at step 5343.
INFO:tensorflow:Recording summary at step 5347.
INFO:tensorflow:global step 5350: loss = 6.8213 (30.08 sec/step)
INFO:tensorflow:Recording summary at step 5351.
INFO:tensorflow:Recording summary at step 5355.
INFO:tensorflow:Saving checkpoint to path E:\Projects\TEXT DETECTION\chinese_text_in_the_wild\ctw-baseline-master\classification\products\train_logs_alexnet_v2\model.ckpt
INFO:tensorflow:Recording summary at step 5359.
INFO:tensorflow:global step 5360: loss = 6.8168 (29.48 sec/step)
INFO:tensorflow:Recording summary at step 5363.
INFO:tensorflow:Recording summary at step 5367.
INFO:tensorflow:global step 5370: loss = 6.8478 (29.09 sec/step)
INFO:tensorflow:Recording summary at step 5371.
INFO:tensorflow:Recording summary at step 5375.
INFO:tensorflow:Recording summary at step 5376.
INFO:tensorflow:global step 5380: loss = 6.8576 (30.47 sec/step)
INFO:tensorflow:Recording summary at step 5380.
INFO:tensorflow:Recording summary at step 5384.
INFO:tensorflow:Recording summary at step 5388.
INFO:tensorflow:global step 5390: loss = 6.8722 (30.95 sec/step)
INFO:tensorflow:Recording summary at step 5392.

yuantailing commented 5 years ago

The AlexNet v2 model is relatively small. It performs 0.2 sec/step on NVIDIA GTX TITAN X.

I never meet such slow training progress. Did you concurrently run other GPU programs? Maybe restart will help.

You can change the summary interval to a very large number (e.g., 999999) to close the summary. It may help.

https://github.com/yuantailing/ctw-baseline/blob/081c8361b0ed0675bf619ab6316ebf8415e93f46/classification/train.py#L30

In my experience, the warning CUDA_ERROR_OUT_OF_MEMORY is inessential -- Tensorflow will use a memory-costly algorithm if you have a large GPU memory, it works well too using smaller memory. NVIDIA GTX TITAN X has only 12GB memory in fact.

mrxiaohe commented 5 years ago

Thanks for the response! I don't have any other GPU programs running concurrently. What would be the best way to check if GPU is actually being used? I pip installed a Python module called GPUtil, which shows that 96% of GPU memory is being utilized, but it doesn't seem to say that GPU is actually getting used:

yuantailing commented 5 years ago

On Linux, I run nvidia-smi. But I don't know how to do this on Windows.

nvidia-smi

Your GPU Util should be very low -- for each step, GPU can do the computation in 0.2 sec, but it takes 30 sec. Thus in 29 sec you will see GPU Util is 0, while in the other 1 sec you will see GPU Util is not 0. It seems most time is spent on preparing data and transferring data to GPU memory.

mrxiaohe commented 5 years ago

Thanks for following up so quickly! I just ran nvidia-smi on my Windows. It looks like Python (the one where training is being run) uses almost all the GPU memory, but volatile GPU utilization is 0:

yuantailing / ctw-baseline

Very slow training speed. Is this expected for my system setup? #34