Training on GPU is failing

Levaru commented 2 years ago

I managed to start the training but it was running on the CPU because tensorflow-gpu was missing. After installing tensorflow-gpu==1.13.1 and Cuda 10.0 along with the corresponding cudnn 7.6.4 the training fails with the following error message:

Training for the epoch 0/100 ... 2022-02-24 19:27:57.240773: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally 2022-02-24 19:29:30.267115: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_EXECUTION_FAILED 2022-02-24 19:29:30.267153: E tensorflow/stream_executor/cuda/cuda_blas.cc:2620] Internal: failed BLAS call, see log for details 2022-02-24 19:29:30.267182: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_EXECUTION_FAILED 2022-02-24 19:29:30.267214: E tensorflow/stream_executor/cuda/cuda_blas.cc:2620] Internal: failed BLAS call, see log for details

Traceback (most recent call last): File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[4,4096,50], b.shape=[4,50,4096], m=4096, n=4096, k=50, batch_size=4 [[{{node MatMul_4}}]] [[{{node Mean_2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "fpcc_train.py", line 282, in train() File "fpcc_train.py", line 271, in train train_one_epoch(epoch) File "fpcc_train.py", line 236, in train_oneepoch , loss_val,score_loss_val, grouperr_val = sess.run([train_op, loss, score_loss, grouperr], feed_dict=feed_dict) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[4,4096,50], b.shape=[4,50,4096], m=4096, n=4096, k=50, batch_size=4 [[node MatMul_4 (defined at /home/ceres/git/FPCC/models/model.py:199) ]] [[node Mean_2 (defined at /home/ceres/git/FPCC/models/model.py:287) ]]

Caused by op 'MatMul_4', defined at: File "fpcc_train.py", line 282, in train() File "fpcc_train.py", line 129, in train loss, score_loss, grouperr = model.get_loss(net_output, labels,vdm, asm, D_MAX, MARGINS) File "/home/ceres/git/FPCC/models/model.py", line 199, in get_loss group_mat_label = tf.matmul(pts_group_label,tf.transpose(pts_group_label, perm=[0, 2, 1])) #BxNxN: (i,j) = 1 if i and j in the same group File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2417, in matmul a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1423, in batch_mat_mul "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/ceres/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[4,4096,50], b.shape=[4,50,4096], m=4096, n=4096, k=50, batch_size=4 [[node MatMul_4 (defined at /home/ceres/git/FPCC/models/model.py:199) ]] [[node Mean_2 (defined at /home/ceres/git/FPCC/models/model.py:287) ]]

Am I using the correct dev environment? Your paper says that you used a GTX1080 but I have an RTX3070. Is my card not compatible with the older tensorflow or CUDA versions or did I setup the wrong enviroment?

Levaru commented 2 years ago

I think I figured out what the problem is after upgrading to tensorflow-gpu==1.15. It looks like I'm running out of memory:

Resource exhausted: OOM when allocating tensor with shape[4,4096,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Is this normal? Was your model trained on a single grapics card or on a cluster?

Update: After reducing the batch size I get the same error again. I tried to find a solution online but there is almost no information on this error and the proposed solution was already present in the code:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
sess = tf.Session(config=config)

xyjbaal commented 2 years ago

I trained my model on a single GPU. I didn't train it on an RTX3080 but trained on an RTX2080. I'm not quite sure what it is, were you able to train the network with the dataset I provided? Or tried to run fpcc_test.py?

Levaru commented 2 years ago

I was finally able to start and complete the training with a RTX3070. I'm guessing that issue was some kind of compatibility problem between the Tensorflow 1.x version and the new RTX cards but I'm not sure.

I solved this by installing the Tensorflow version maintained by Nvidia. You can follow this guide if you want to do this with a conda environment or just use the following commands like I did:

pip install nvidia-pyindex
pip install nvidia-tensorflow[horovod]

xyjbaal commented 2 years ago

Thank you very much for the suggestion. TF1.x doesn't seem to have the flexibility to use multiple GPUs based on the data.

xyjbaal / FPCC

Training on GPU is failing #7