yourh / AttentionXML

Implementation for "AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification"
245 stars 41 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #31

Closed celsofranssa closed 1 year ago

celsofranssa commented 1 year ago

Even with the required environment already created I am facing multiples errors like:

[I 230221 09:56:33 main:37] Model Name: AttentionXML
[I 230221 09:56:33 main:40] Loading Training and Validation Set
[I 230221 09:56:33 main:52] Number of Labels: 29801
[I 230221 09:56:33 main:53] Size of Training Set: 14748
[I 230221 09:56:33 main:54] Size of Validation Set: 200
[I 230221 09:56:33 main:56] Training
Traceback (most recent call last):
  File "main.py", line 95, in <module>
    main()
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 64, in main
    model.train(train_loader, valid_loader, **model_cnf['train'])
  File "/home/celso/projects/AttentionXML/deepxml/models.py", line 67, in train
    loss = self.train_step(train_x, train_y.cuda())
  File "/home/celso/projects/AttentionXML/deepxml/models.py", line 42, in train_step
    scores = self.model(train_x)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/celso/projects/AttentionXML/deepxml/networks.py", line 42, in forward
    rnn_out = self.lstm(emb_out, lengths)   # N, L, hidden_size * 2
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/celso/projects/AttentionXML/deepxml/modules.py", line 60, in forward
    self.lstm(packed_inputs, (hidden_init, cell_init))[0], batch_first=True)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/celso/projects/venvs/AttentionXML/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 561, in forward
    result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Are you sure that all tensor operation happens in the same device?

yourh commented 1 year ago

Yes, I'm sure that all tensor operation happens on the same device. And I guess you're running out of memory.

celsofranssa commented 1 year ago

Yes, I'm sure that all tensor operation happens on the same device. And I guess you're running out of memory.

Thank you.

celsofranssa commented 1 year ago

I keep getting this error mainly for the line:

self.lstm(packed_inputs, (hidden_init, cell_init))[0], batch_first=True)

Could you help me?

celsofranssa commented 1 year ago

I've Just updated the torch version to the current release and did some refactors

hieudx149 commented 5 months ago

Hello @celsofranssa, what torch version you updated

celsofranssa commented 5 months ago

Hello @celsofranssa, what torch version you updated

torch==1.13.1