RutimeError: Expected all tensors to be on the same device

Kamalnl92 commented 2 years ago

While training the Push model (Stage2) I get this error: RutimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument bias in method wrapper_cudnn_batch_norm)

This RunTimeError happens usually at a certain iteration during the training between the 200 and 2000 iterations.

xukechun commented 2 years ago

Hi Kamalnl92! It seems that something is forgotten to be put into the GPU. I wonder more details about this error. Would it occur during the first 200 iterations?

Kamalnl92 commented 2 years ago

Hello xukechun, I am training the stages again in case I did something wrong, I will get back to you as soon as possible.

Kamalnl92 commented 2 years ago

I was able to reproduce the error again at Training iteration: 200 episode 122 begins. I believe it was able to finish that iteration I got this at the end Training loss: 0.006129 Push successful: True

but then I get the below RunTimeError

/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:208: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaimingnormal. nn.init.kaiming_normal(m[1].weight.data) /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead. warnings.warn(warning.format(ret)) /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:4066: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. "Default grid_sample and affine_grid behavior has changed " /home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:241: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. rotate_color = F.grid_sample(Variable(input_color_data, volatile=True).cuda(), flow_grid_before, mode='nearest') /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:4004: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. "Default grid_sample and affine_grid behavior has changed " /home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:242: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. rotate_depth = F.grid_sample(Variable(input_depth_data, volatile=True).cuda(), flow_grid_before, mode='nearest') /home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:243: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. rotate_mask = F.grid_sample(Variable(goal_mask_data, volatile=True).cuda(), flow_grid_before, mode='nearest') /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:3635: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode) /home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/main.py:595: UserWarning: Input image is entirely zero, no valid convex hull. Returning empty image prev_mask_hull = binary_dilation(convex_hull_image(prev_goal_mask_heightmap), iterations=5) Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, self._kwargs) File "/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/main.py", line 377, in process_actions latest_push_predictions, latest_grasp_predictions, latest_state_feat = trainer.goal_forward(latest_color_heightmap, latest_valid_depth_heightmap, latest_goal_mask_heightmap, is_volatile=True) File "/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/trainer.py", line 335, in goal_forward output_prob, state_feat= self.model.forward(input_color_data.cuda(), input_depth_data.cuda(), input_goal_mask_data.cuda(), is_volatile, specific_rotation) File "/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py", line 252, in forward interm_push_color_feat = self.push_color_trunk.features(rotate_color) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 446, in forward return self._conv_forward(input, self.weight, self.bias) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 443, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

xukechun commented 2 years ago

Hi Kamalnl92, It seems that the model was failed to be converted from CPU to GPU. The relevant code might be line #27-#36 in trainer.py which detects the cuda device: https://github.com/xukechun/Efficient_goal-oriented_push-grasping_synergy/blob/1808adeeb7c8d6f87cb6604bff964ef887abe055/trainer.py#L27-L36 https://github.com/xukechun/Efficient_goal-oriented_push-grasping_synergy/blob/1808adeeb7c8d6f87cb6604bff964ef887abe055/trainer.py#L115-L116 So I think you can try the comment torch.cuda.is_available() to see its output? If there still remains the question, please let me know.

Kamalnl92 commented 2 years ago

Hello Xukechun,

I have printed what you see in the trainer.py below: CUDA availability and model device

The result: self.use_cuda before conversion False self.use_cuda after conversion True torch.cuda.is_available() True

Next to that I also printed the forward passing in the models in models.py, see the below images:

The terminal output can be found below:

pythonOut1.txt slurm-22533373_1.txt

I am using CUDA version: 11.2 but I doubt that has to do anything with this problem. Looking forward to your suggestions

xukechun commented 2 years ago

Hello Kamalnl92: So sorry to hear that. I haven't met this problem before. But our code has been tested on cuda 11.0 with GTX 3090. You mean you can train stage one but failed in stage two? But the code is very similar.

Kamalnl92 commented 2 years ago

Hello xukechun,

I have changed the Cuda version to 11.0 that was not the issue.

I was able to solve this issue by changing the code line below https://github.com/xukechun/Efficient_goal-oriented_push-grasping_synergy/blob/1808adeeb7c8d6f87cb6604bff964ef887abe055/logger.py#L103

this had to be changed to in logger.py

def save_model(self, iteration, model, name): torch.save(model.cuda().state_dict(), os.path.join(self.models_directory, 'snapshot-%06d.%s.pth' % (iteration, name)))

def save_backup_model(self, model, name):
    torch.save(model.cuda().state_dict(), os.path.join(self.models_directory, 'snapshot-backup.%s.pth' % (name)))

I will show you the other problems I have with this code in the new issue "Not working repo #4"

Kamal

xukechun commented 2 years ago

Happy to hear that. We just follow the logger.py from the repo of VPG.

xukechun / Efficient_goal-oriented_push-grasping_synergy

RutimeError: Expected all tensors to be on the same device #3