Open mrxiaohe opened 5 years ago
The AlexNet v2 model is relatively small. It performs 0.2 sec/step on NVIDIA GTX TITAN X.
I never meet such slow training progress. Did you concurrently run other GPU programs? Maybe restart will help.
You can change the summary interval to a very large number (e.g., 999999) to close the summary. It may help.
In my experience, the warning CUDA_ERROR_OUT_OF_MEMORY
is inessential -- Tensorflow will use a memory-costly algorithm if you have a large GPU memory, it works well too using smaller memory. NVIDIA GTX TITAN X has only 12GB memory in fact.
Thanks for the response! I don't have any other GPU programs running concurrently. What would be the best way to check if GPU is actually being used? I pip install
ed a Python module called GPUtil
, which shows that 96% of GPU memory is being utilized, but it doesn't seem to say that GPU is actually getting used:
On Linux, I run nvidia-smi
. But I don't know how to do this on Windows.
Your GPU Util should be very low -- for each step, GPU can do the computation in 0.2 sec, but it takes 30 sec. Thus in 29 sec you will see GPU Util is 0, while in the other 1 sec you will see GPU Util is not 0. It seems most time is spent on preparing data and transferring data to GPU memory.
Thanks for following up so quickly! I just ran nvidia-smi
on my Windows. It looks like Python (the one where training is being run) uses almost all the GPU memory, but volatile GPU utilization is 0:
I am currently training a model using the Chinese in the Wild image data. My system setup is as follows:
The speed is shown below: Each step takes close to 30 seconds. The training has been running for 2 days, and it's only done 5410 steps, so far. It seems like GPU is getting utilized -- 96% of the GPU memory is used. The CPU also shows quite a bit of activity - e.g., 40% by the Python session in which the training is running.
Also, when I started training, I got the message
failed to allocate 15.90G (17071144960 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
. Not sure if this is relatedSo my question is if the speed I am observing is normal for the kind of computer setup I have, and how I might improve the speed. Thanks!