Closed Levaru closed 2 years ago
I think I figured out what the problem is after upgrading to tensorflow-gpu==1.15
. It looks like I'm running out of memory:
Resource exhausted: OOM when allocating tensor with shape[4,4096,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Is this normal? Was your model trained on a single grapics card or on a cluster?
Update: After reducing the batch size I get the same error again. I tried to find a solution online but there is almost no information on this error and the proposed solution was already present in the code:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
sess = tf.Session(config=config)
I trained my model on a single GPU. I didn't train it on an RTX3080 but trained on an RTX2080. I'm not quite sure what it is, were you able to train the network with the dataset I provided? Or tried to run fpcc_test.py?
I was finally able to start and complete the training with a RTX3070. I'm guessing that issue was some kind of compatibility problem between the Tensorflow 1.x version and the new RTX cards but I'm not sure.
I solved this by installing the Tensorflow version maintained by Nvidia. You can follow this guide if you want to do this with a conda environment or just use the following commands like I did:
pip install nvidia-pyindex
pip install nvidia-tensorflow[horovod]
Thank you very much for the suggestion. TF1.x doesn't seem to have the flexibility to use multiple GPUs based on the data.
I managed to start the training but it was running on the CPU because
tensorflow-gpu
was missing. After installingtensorflow-gpu==1.13.1
and Cuda 10.0 along with the corresponding cudnn 7.6.4 the training fails with the following error message:Am I using the correct dev environment? Your paper says that you used a GTX1080 but I have an RTX3070. Is my card not compatible with the older tensorflow or CUDA versions or did I setup the wrong enviroment?