Multi-GPUs - Githubissues

uuid123 commented 3 years ago

Hello, Mr.yu I have a problem when I try to train the model in multi GPUs

device_ids = [0, 1]
model = torch.nn.parallel.DataParallel(model, device_ids=device_ids)
model.to(device)

When I use the above codes, there are some problem as follow:

Traceback (most recent call last):
  File "/home/wxx/progressfiles/project_pc_NLP/NewsRecommendation-master2/src/train.py", line 318, in <module>
    train()
  File "/home/wxx/progressfiles/project_pc_NLP/NewsRecommendation-master2/src/train.py", line 227, in train
    minibatch["clicked_news"])
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wxx/progressfiles/project_pc_NLP/NewsRecommendation-master2/src/model/DKN/__init__.py", line 48, in forward
    [self.kcnn(x) for x in candidate_news], dim=1)
  File "/home/wxx/progressfiles/project_pc_NLP/NewsRecommendation-master2/src/model/DKN/__init__.py", line 48, in <listcomp>
    [self.kcnn(x) for x in candidate_news], dim=1)
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wxx/progressfiles/project_pc_NLP/NewsRecommendation-master2/src/model/DKN/KCNN.py", line 69, in forward
    word_vector = self.word_embedding(news["title"].to(device))
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/wxx/progressfiles/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

Process finished with exit code 1

yusanshi commented 3 years ago

Sorry but I have no experience with multi-GPU training. So I believe I'm not able to help you :)

BTW, should it be torch.nn.DataParallel? I searched torch.nn.parallel.DataParallel but didn't find it.

uuid123 commented 3 years ago

But the effect is the same. And ,Looks like this is where the error came in

 y_pred = model(minibatch["candidate_news"],
                   minibatch["clicked_news"])

yusanshi / news-recommendation

Multi-GPUs #20