Closed Kamalnl92 closed 2 years ago
Hi Kamalnl92! It seems that something is forgotten to be put into the GPU. I wonder more details about this error. Would it occur during the first 200 iterations?
Hello xukechun, I am training the stages again in case I did something wrong, I will get back to you as soon as possible.
I was able to reproduce the error again at Training iteration: 200 episode 122 begins. I believe it was able to finish that iteration I got this at the end Training loss: 0.006129 Push successful: True
but then I get the below RunTimeError
/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:208: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaimingnormal.
nn.init.kaiming_normal(m[1].weight.data)
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:4066: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
"Default grid_sample and affine_grid behavior has changed "
/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:241: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
rotate_color = F.grid_sample(Variable(input_color_data, volatile=True).cuda(), flow_grid_before, mode='nearest')
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:4004: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
"Default grid_sample and affine_grid behavior has changed "
/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:242: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
rotate_depth = F.grid_sample(Variable(input_depth_data, volatile=True).cuda(), flow_grid_before, mode='nearest')
/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py:243: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
rotate_mask = F.grid_sample(Variable(goal_mask_data, volatile=True).cuda(), flow_grid_before, mode='nearest')
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:3635: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode)
/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/main.py:595: UserWarning: Input image is entirely zero, no valid convex hull. Returning empty image
prev_mask_hull = binary_dilation(convex_hull_image(prev_goal_mask_heightmap), iterations=5)
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, self._kwargs)
File "/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/main.py", line 377, in process_actions
latest_push_predictions, latest_grasp_predictions, latest_state_feat = trainer.goal_forward(latest_color_heightmap, latest_valid_depth_heightmap, latest_goal_mask_heightmap, is_volatile=True)
File "/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/trainer.py", line 335, in goal_forward
output_prob, state_feat= self.model.forward(input_color_data.cuda(), input_depth_data.cuda(), input_goal_mask_data.cuda(), is_volatile, specific_rotation)
File "/home/s3675319/Xu/Efficient_goal-oriented_push-grasping_synergy/models.py", line 252, in forward
interm_push_color_feat = self.push_color_trunk.features(rotate_color)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Hi Kamalnl92,
It seems that the model was failed to be converted from CPU to GPU. The relevant code might be line #27-#36 in trainer.py which detects the cuda device:
https://github.com/xukechun/Efficient_goal-oriented_push-grasping_synergy/blob/1808adeeb7c8d6f87cb6604bff964ef887abe055/trainer.py#L27-L36
https://github.com/xukechun/Efficient_goal-oriented_push-grasping_synergy/blob/1808adeeb7c8d6f87cb6604bff964ef887abe055/trainer.py#L115-L116
So I think you can try the comment torch.cuda.is_available()
to see its output? If there still remains the question, please let me know.
Hello Xukechun,
I have printed what you see in the trainer.py below: CUDA availability and model device
The result: self.use_cuda before conversion False self.use_cuda after conversion True torch.cuda.is_available() True
Next to that I also printed the forward passing in the models in models.py, see the below images:
The terminal output can be found below:
pythonOut1.txt slurm-22533373_1.txt
I am using CUDA version: 11.2 but I doubt that has to do anything with this problem. Looking forward to your suggestions
Hello Kamalnl92: So sorry to hear that. I haven't met this problem before. But our code has been tested on cuda 11.0 with GTX 3090. You mean you can train stage one but failed in stage two? But the code is very similar.
Hello xukechun,
I have changed the Cuda version to 11.0 that was not the issue.
I was able to solve this issue by changing the code line below https://github.com/xukechun/Efficient_goal-oriented_push-grasping_synergy/blob/1808adeeb7c8d6f87cb6604bff964ef887abe055/logger.py#L103
this had to be changed to in logger.py
def save_model(self, iteration, model, name): torch.save(model.cuda().state_dict(), os.path.join(self.models_directory, 'snapshot-%06d.%s.pth' % (iteration, name)))
def save_backup_model(self, model, name):
torch.save(model.cuda().state_dict(), os.path.join(self.models_directory, 'snapshot-backup.%s.pth' % (name)))
I will show you the other problems I have with this code in the new issue "Not working repo #4"
Kamal
While training the Push model (Stage2) I get this error: RutimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument bias in method wrapper_cudnn_batch_norm)
This RunTimeError happens usually at a certain iteration during the training between the 200 and 2000 iterations.