yangyanli / PointCNN

PointCNN: Convolution On X-Transformed Points (NeurIPS 2018)
https://arxiv.org/abs/1801.07791
Other
1.37k stars 365 forks source link

Issue with bigger batch size #212

Closed sayakgis closed 4 years ago

sayakgis commented 4 years ago

Hi @burui11087, @yangyanli ,

I have been trying to use PointCNN for aerial lidar data segmentation, for which i had used sample_num as 12288 with batch size 4 successfully on 16GB V100 card. My versions were tensorflow 1.10.1 and cuda 9.2, i could compile the tf_compile then.

Now when i got by 32GB single V100 card i was trying to fit bigger batch size (12) with same sample_num, now i get this error as below: could you please suggest what i am missing?

However when i pass batch size of 4 the training starts well with same set-up.

My versions: cuda toolkit 9.2 tensorflow 1.10.1 nvidia drivers 415.27

log of error:

E tensorflow/stream_executor/cuda/cuda_blas.cc:647] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_NOT_SUPPORTED 2020-03-09 10:35:51.149051: E tensorflow/stream_executor/cuda/cuda_blas.cc:2510] Internal: failed BLAS call, see log for details 2020-03-09 10:35:51.149173: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f657d7f89d0 2020-03-09 10:35:51.149236: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f657d7f89f0 2020-03-09 10:35:51.153013: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f655dff99d0 2020-03-09 10:35:51.153059: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f655dff99f0 2020-03-09 10:35:51.153214: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f657dff99d0 2020-03-09 10:35:51.153262: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f657dff99f0 2020-03-09 10:35:51.153464: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f655f7fc9d0 2020-03-09 10:35:51.153511: I tensorflow/stream_executor/stream.cc:4818] stream 0x5648deb70150 did not memzero GPU location; source: 0x7f655f7fc9f0 Traceback (most recent call last): File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[147456,12,12], b.shape=[147456,12,96], m=12, n=96, k=12, batch_size=147456 [[Node: xconv_1_fts_X = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xconv_1_X_2_KK, xconv_1_nn_fts_input-0-1-TransposeNCHWToNHWC-LayoutOptimizer)]] [[Node: metrics/accuracy/broadcast_weights/assert_broadcastable/AssertGuard/Assert/Switch_3/_485 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_44108...t/Switch_3", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 311, in is_training: True, File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[147456,12,12], b.shape=[147456,12,96], m=12, n=96, k=12, batch_size=147456 [[Node: xconv_1_fts_X = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xconv_1_X_2_KK, xconv_1_nn_fts_input-0-1-TransposeNCHWToNHWC-LayoutOptimizer)]] [[Node: metrics/accuracy/broadcast_weights/assert_broadcastable/AssertGuard/Assert/Switch_3/_485 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_44108...t/Switch_3", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'xconv_1_fts_X', defined at: File "train.py", line 132, in net = model.Net(points_augmented, features_augmented, is_training, setting) File "/home/sayak_cowi/notebooks/PointCNN/pointcnn_seg.py", line 11, in init PointCNN.init(self, points, features, is_training, setting) File "/home/sayak_cowi/notebooks/PointCNN/pointcnn.py", line 116, in init depth_multiplier, sorting_method, with_global) File "/home/sayak_cowi/notebooks/PointCNN/pointcnn.py", line 39, in xconv fts_X = tf.matmul(X_2_KK, nn_fts_input, name=tag + 'fts_X') File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1980, in matmul a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1236, in batch_mat_mul "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, **kwargs) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/sayak_cowi/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[147456,12,12], b.shape=[147456,12,96], m=12, n=96, k=12, batch_size=147456 [[Node: xconv_1_fts_X = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xconv_1_X_2_KK, xconv_1_nn_fts_input-0-1-TransposeNCHWToNHWC-LayoutOptimizer)]] [[Node: metrics/accuracy/broadcast_weights/assert_broadcastable/AssertGuard/Assert/Switch_3/_485 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_44108...t/Switch_3", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

sayakgis commented 4 years ago

Could resolve the issue with downgrading to Cuda 9.2, cudnn 7.6.5