Cuda Error : RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

HMM2021 commented 3 years ago

Hi, I got an error : Cuda Error : RuntimeError: CUDNN_STATUS_EXECUTION_FAILED when trying to train the level 1 and this because of: loss = self.train_step(train_x, train_y.cuda()) (line 70 models.py) when i change this line to loss = self.train_step(train_x.cuda(), train_y.cuda()) i still get other issues !!!

yourh commented 3 years ago

From your description I guess there some problems of your environment. Please give me the detailed error information.

HMM2021 commented 3 years ago

I retried the training of the level 2, then i get this error : ``RuntimeError Traceback (most recent call last)

in 2 start_time = time.time() 3 model = FastAttentionXML(labels_num, data_cnf, model_cnf, '') ----> 4 model.train(train_x, train_y, valid_x, valid_y, mlb) 5 finish_time = time.time() 6 print('Training Finished') /home/hmouzoun/patent_classification/AttentionXML/deepxml/tree.py in train(self, train_x, train_y, valid_x, valid_y, mlb) 204 kwargs=self.model_cnf['cluster']) 205 cluster_process.start() --> 206 self.train_level(self.level - 1, train_x, train_y, valid_x, valid_y) 207 cluster_process.join() 208 cluster_process.close() /home/hmouzoun/patent_classification/AttentionXML/deepxml/tree.py in train_level(self, level, train_x, train_y, valid_x, valid_y) 87 return train_y, model.predict(train_loader, k=self.top), model.predict(valid_loader, k=self.top) 88 else: ---> 89 train_group_y, train_group, valid_group = self.train_level(level - 1, train_x, train_y, valid_x, valid_y) 90 torch.cuda.empty_cache() 91 /home/hmouzoun/patent_classification/AttentionXML/deepxml/tree.py in train_level(self, level, train_x, train_y, valid_x, valid_y) 147 F'Number of Labels: {labels_num}, ' 148 F'Candidates Number: {train_loader.dataset.candidates_num}') --> 149 model.train(train_loader, valid_loader, **model_cnf['train'][level]) 150 model.optimizer = model.state = None 151 logger.info(F'Finish Training Level-{level}') /home/hmouzoun/patent_classification/AttentionXML/deepxml/models.py in train(self, *args, **kwargs) 170 171 def train(self, *args, **kwargs): --> 172 super(XMLModel, self).train(*args, **kwargs) 173 self.save_model_to_disk() 174 /home/hmouzoun/patent_classification/AttentionXML/deepxml/models.py in train(self, train_loader, valid_loader, opt_params, nb_epoch, step, k, early, verbose, swa_warmup, **kwargs) 68 #if type(train_x)!=list: 69 # train_x = train_x.cpu() ---> 70 loss = self.train_step(train_x, train_y.cuda())#change train_x to train_x.cuda() 71 if global_step % step == 0: 72 self.swa_step() /home/hmouzoun/patent_classification/AttentionXML/deepxml/models.py in train_step(self, train_x, train_y) 156 scores = self.network(train_x, candidates=candidates, attn_weights=self.attn_weights) 157 loss = self.loss_fn(scores, train_y) --> 158 loss.backward() 159 self.clip_gradient() 160 self.optimizer.step(closure=None) /usr/local/lib/python3.8/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph) 219 retain_graph=retain_graph, 220 create_graph=create_graph) --> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph) 222 223 def register_hook(self, hook): /usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 128 retain_graph = create_graph 129 --> 130 Variable._execution_engine.run_backward( 131 tensors, grad_tensors_, retain_graph, create_graph, 132 allow_unreachable=True) # allow_unreachable flag RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED`` I think that's a problem of OOM knowing that i'm training the algo on data set containing 3720000 examples of 300 words in average, only on one GPU.

yourh commented 3 years ago

I think it's a problem of your NVIDIA driver or CUDA library, not OOM.

celsofranssa commented 1 year ago

Hi @HMM2021 did you resolve this issue?

yourh / AttentionXML

Cuda Error : RuntimeError: CUDNN_STATUS_EXECUTION_FAILED #22