xiangwang1223 / neural_graph_collaborative_filtering

Neural Graph Collaborative Filtering, SIGIR2019
MIT License
781 stars 261 forks source link

InternalError #57

Open qianlong0502 opened 1 year ago

qianlong0502 commented 1 year ago

Have you met this error? I am very confused. I have tried running this code on several different hosts and GPUs. But I still got this error.

python NGCF.py --dataset gowalla --regs [1e-5] --embed_size 64 --layer_size [64,64,64] --lr 0.0001 --save_flag 1 --pretrain 0 --batch_size 1024 --epoch 400 --verbose 1 --node_dropout [0.1] --mess_dropout [0.1,0.1,0.1]
n_users=29858, n_items=40981
n_interactions=1027370
n_train=810128, n_test=217242, sparsity=0.00084
already load adj matrix (70839, 70839) 0.16373920440673828
use the normalized adjacency matrix
using xavier initialization
2023-01-11 11:21:00.770564: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:884] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
without pretraining.
Epoch 0 [102.3s]: train==[549.91321=549.87742 + 0.00000]
Epoch 1 [100.1s]: train==[550.38794=550.35171 + 0.00000]
Epoch 2 [100.2s]: train==[551.38123=551.34350 + 0.00000]
Epoch 3 [100.2s]: train==[552.64801=552.60893 + 0.00000]
Epoch 4 [100.1s]: train==[554.05444=554.01392 + 0.00000]
Epoch 5 [100.2s]: train==[555.21527=555.17249 + 0.00000]
Epoch 6 [100.3s]: train==[557.17529=557.13013 + 0.00000]
Epoch 7 [100.2s]: train==[559.49915=559.45179 + 0.00000]
Epoch 8 [100.1s]: train==[561.67401=561.62488 + 0.00000]
2023-01-11 11:37:45.543238: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256
         [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]]
         [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "NGCF.py", line 490, in <module>
    ret = test(sess, model, users_to_test, drop_flag=True)
  File "/root/our-mm-learning/codes/NGCF2/NGCF/utility/batch_test.py", line 167, in test
    model.mess_dropout: [0.] * len(eval(args.layer_size))})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256
         [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]]
         [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'MatMul_6', defined at:
  File "NGCF.py", line 360, in <module>
    model = NGCF(data_config=config, pretrain_data=pretrain_data)
  File "NGCF.py", line 101, in __init__
    self.batch_ratings = tf.matmul(self.u_g_embeddings, self.pos_i_g_embeddings, transpose_a=False, transpose_b=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4279, in mat_mul
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(2048, 256), b.shape=(40981, 256), m=2048, n=40981, k=256
         [[Node: MatMul_6 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_lookup, embedding_lookup_1)]]
         [[Node: MatMul_6/_53 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1902_MatMul_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]