When I add --n_gpu=2 in .sh file, my program has raise the error below:
Traceback (most recent call last):
File "main_train.py", line 135, in
main()
File "main_train.py", line 131, in main
trainer.train()
File "/mnt/webdisk//R2GenCMN-main/modules/trainer.py", line 58, in train
result = self._train_epoch(epoch)
File "/mnt/webdisk//R2GenCMN-main/modules/trainer.py", line 185, in _train_epoch
output = self.model(images, reports_ids, mode='train')
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/mnt/webdisk//R2GenCMN-main/models/models.py", line 27, in forward_iu_xray
att_feats_0, fc_feats_0 = self.visual_extractor(images[:, 0])
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/mnt/webdisk//R2GenCMN-main/modules/visual_extractor.py", line 17, in forward
patch_feats = self.model(images)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu-4/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)
Have you ever raised the same error or can you explain how to train using multi gpus on your mchine?
Thanks!
When I add --n_gpu=2 in .sh file, my program has raise the error below:
Have you ever raised the same error or can you explain how to train using multi gpus on your mchine? Thanks!