Open qianlong0502 opened 1 year ago
Have you met this error? I am very confused. I have tried running this code on several different hosts and GPUs. But I still got this error.
python NGCF.py --dataset gowalla --regs [1e-5] --embed_size 64 --layer_size [64,64,64] --lr 0.0001 --save_flag 1 --pretrain 0 --batch_size 1024 --epoch 400 --verbose 1 --node_dropout [0.1] --mess_dropout [0.1,0.1,0.1] n_users=29858, n_items=40981 n_interactions=1027370 n_train=810128, n_test=217242, sparsity=0.00084 already load adj matrix (70839, 70839) 0.16373920440673828 use the normalized adjacency matrix using xavier initialization 2023-01-11 11:21:00.770564: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:884] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. without pretraining. Epoch 0 [102.3s]: train==[549.91321=549.87742 + 0.00000] Epoch 1 [100.1s]: train==[550.38794=550.35171 + 0.00000] Epoch 2 [100.2s]: train==[551.38123=551.34350 + 0.00000] Epoch 3 [100.2s]: train==[552.64801=552.60893 + 0.00000] Epoch 4 [100.1s]: train==[554.05444=554.01392 + 0.00000] Epoch 5 [100.2s]: train==[555.21527=555.17249 + 0.00000] Epoch 6 [100.3s]: train==[557.17529=557.13013 + 0.00000] Epoch 7 [100.2s]: train==[559.49915=559.45179 + 0.00000] Epoch 8 [100.1s]: train==[561.67401=561.62488 + 0.00000] 2023-01-11 11:37:45.543238: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256 [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]] [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "NGCF.py", line 490, in <module> ret = test(sess, model, users_to_test, drop_flag=True) File "/root/our-mm-learning/codes/NGCF2/NGCF/utility/batch_test.py", line 167, in test model.mess_dropout: [0.] * len(eval(args.layer_size))}) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256 [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]] [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by op 'MatMul_6', defined at: File "NGCF.py", line 360, in <module> model = NGCF(data_config=config, pretrain_data=pretrain_data) File "NGCF.py", line 101, in __init__ self.batch_ratings = tf.matmul(self.u_g_embeddings, self.pos_i_g_embeddings, transpose_a=False, transpose_b=True) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 2122, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4279, in mat_mul name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256 [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]] [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Have you met this error? I am very confused. I have tried running this code on several different hosts and GPUs. But I still got this error.