zhjohnchan / R2GenCMN

[ACL-2021] The official implementation of Cross-modal Memory Networks for Radiology Report Generation.
Apache License 2.0
77 stars 7 forks source link

Runtime Error when training using multi gpu #9

Closed dikiyul closed 1 year ago

dikiyul commented 1 year ago

When I add --n_gpu=2 in .sh file, my program has raise the error below:

Traceback (most recent call last): File "main_train.py", line 135, in main() File "main_train.py", line 131, in main trainer.train() File "/mnt/webdisk//R2GenCMN-main/modules/trainer.py", line 58, in train result = self._train_epoch(epoch) File "/mnt/webdisk//R2GenCMN-main/modules/trainer.py", line 185, in _train_epoch output = self.model(images, reports_ids, mode='train') File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise raise exception RuntimeError: Caught RuntimeError in replica 1 on device 1. Original Traceback (most recent call last): File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/mnt/webdisk//R2GenCMN-main/models/models.py", line 27, in forward_iu_xray att_feats_0, fc_feats_0 = self.visual_extractor(images[:, 0]) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/mnt/webdisk//R2GenCMN-main/modules/visual_extractor.py", line 17, in forward patch_feats = self.model(images) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

Have you ever raised the same error or can you explain how to train using multi gpus on your mchine? Thanks!

zhjohnchan commented 1 year ago

Hi @dikiyul,

Thanks for your attention! The codebase is not supported for multiple GPUs. We refer you to this up-to-date project here.

All the best, Zhihong