zgojcic / Rigid3DSceneFlow

[CVPR 2021, Oral] "Weakly Supervised Learning of Rigid 3D Scene Flow"
137 stars 18 forks source link

Invalid in_feat_size 0 with Cuda 11 #1

Closed zgojcic closed 3 years ago

zgojcic commented 3 years ago

When using Cuda 11 our model returns the following error:

File "/home/zgojcic/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine-0.5.1-py3.7-linux-x86_64.egg/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
    coordinate_manager._manager,
RuntimeError: /home/zgojcic/Documents/Rigid3DSceneFlow/MinkowskiEngine/src/convolution_gpu.cu:85, assertion (in_feat.size(0) == p_map_manager->size(in_key)) failed. Invalid in_feat size 0 != 5296

It seems that this is due to the combination of Cuda 11 with MinkowskiEngine. The issue is currently under investigation https://github.com/NVIDIA/MinkowskiEngine/issues/330

Until solved we suggest using Cuda 10.2 or 10.1.

zmlshiwo commented 3 years ago

Hi, Zan. Nice work! I want to run your code. My GPU is 3090, so it only supports Cuda 11. Is this problem solved? Thank you.

zgojcic commented 3 years ago

Hi, yeah this is how we actually first saw the problem (with a 3090). I think that it is not solved yet but Chris is usually very fast with these things so it should be quick. I will update this issue once it is solved.

zmlshiwo commented 3 years ago

OK, great, thank you very much.

zmlshiwo commented 3 years ago

Hi, Zan. I have tested the code in my computer, with Ubuntu20.04, CUDA 11.1, RTX 3090 GPU, MinkowskiEngine-0.5.2, PyTorch 1.8, Python 3.7. I run the training code, it shows the error as follows.

python train.py ./configs/train/train_fully_supervised.yaml /home/ps/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine/init.py:42: UserWarning: The environment variable OMP_NUM_THREADS not set. MinkowskiEngine will automatically set OMP_NUM_THREADS=16. If you want to set OMP_NUM_THREADS manually, please export it on the command line before running a python script. e.g. export OMP_NUM_THREADS=12; python your_program.py. It is recommended to set it below 24. "It is recommended to set it below 24.", Using /home/ps/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ps/.cache/torch_extensions/cd/build.ninja... Building extension module cd... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cd... 2021-03-29 10:21:52 zml root[103079] INFO Command: train.py ./configs/train/train_fully_supervised.yaml 2021-03-29 10:21:52 zml root[103079] INFO Arguments: method_backbone: ME, method_flow: True, method_ego_motion: False, method_semantic: False, method_clustering: False, misc_voxel_size: 0.1, misc_num_points: 8192, misc_trainer: FlowTrainer, misc_use_gpu: True, misc_log_dir: ./logs/, misc_run_mode: train, data_input_features: absolute_coords, data_only_near_points: True, data_dataset: FlyingThings3D_ME, data_root: /media/ps/data/rigid_scene_flow_dataset/flying_things_3d/, data_remove_ground: False, data_augment_data: False, train_batch_size: 8, train_acc_iter_size: 1, train_num_workers: 6, train_max_epoch: 50, train_stat_interval: 5, train_chkpt_interval: 40, train_val_interval: 20, train_weighted_seg_loss: True, val_batch_size: 8, val_num_workers: 6, test_results_dir: ./eval/, test_batch_size: 1, test_num_workers: 1, loss_bg_loss_w: 1.0, loss_fg_loss_w: 1.0, loss_flow_loss_w: 1.0, loss_ego_loss_w: 1.0, loss_inlier_loss_w: 0.005, loss_cd_loss_w: 0.5, loss_rigid_loss_w: 1.0, loss_background_loss: False, loss_flow_loss: True, loss_ego_loss: False, loss_foreground_loss: False, optimizer_alg: Adam, optimizer_learning_rate: 0.001, optimizer_weight_decay: 0.0, optimizer_momentum: 0.8, optimizer_scheduler: ExponentialLR, optimizer_exp_gamma: 0.98, network_normalize_features: True, network_norm_type: IN, network_in_kernel_size: 7, network_feature_dim: 64, network_use_pretrained: True, network_pretrained_path: , metrics_flow: True, metrics_ego_motion: False, metrics_semantic: False 2021-03-29 10:21:52 zml root[103079] INFO Output and logs will be saved to ./logs/logs_FlyingThings3D_ME/21_03_29-10_21_52_431116Method_ME_FlowVoxSize_0.1__Pts_8192 2021-03-29 10:21:52 zml root[103079] INFO Parameter Count: 8073729 2021-03-29 10:21:52 zml root[103079] INFO Torch version: 1.8.0 2021-03-29 10:21:52 zml root[103079] INFO CUDA version: 11.1 2021-03-29 10:21:52 zml root[103079] INFO Training epoch: 0, LR: [0.001] 0%| | 0/1963 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 243, in main(cfg, args.config) File "train.py", line 136, in main losses, metrics, total_loss = trainer.train_step(batch) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/trainer.py", line 45, in train_step losses, metrics = self._compute_loss_metrics(data) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/trainer.py", line 152, in _compute_loss_metrics inferred_values = self.model(sinput1, sinput2, input_dict['pcd_eval_s'], input_dict['pcd_eval_t'], input_dict['fg_labels_s'], input_dict['fg_labels_t']) File "/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/model/rigid_3d_sf.py", line 308, in forward self._infer_flow(dec_feat_1, dec_feat_2) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/model/rigid_3d_sf.py", line 102, in _infer_flow feat_s = flow_f_1.F[flow_f_1.C[:,0] == b_idx] RuntimeError: invalid shape dimension -1122459258 0%| | 0/1963 [00:02<?, ?it/s]

Yes, it shows the invalid shape dimension. I also guess this is the problem of MinkowskiEngine.

zgojcic commented 3 years ago

Hei, yes it seems that this is the same error. In the ME thread there was a comment that the code should work with ME 0.5 (even with CUDA 11.x) so maybe you can try that.

zmlshiwo commented 3 years ago

OK, thank you. I will try it.

zgojcic commented 3 years ago

It seems that this is a problem with the combination of pytorch 1.8.x and Cuda 11.x and not a ME bug. Until fixed I suggest using pytorch 1.7.1 with CUDA 11.X. The updates can also be followed in the issue refferenced in the first post of this thread.

zgojcic commented 3 years ago

Closing due to inactivity. Please open a new issue if you have further questions.

Alt216 commented 2 years ago

Hi @zgojcic , from my understanding 30 series gpu only work with CUDA 11.1 up but pytorch 1.7.1 only works with CUDA 11.0. How did you get pytorch 1.7.1 to work with a CUDA version that is compatible with 30 series gpu? Thanks in advance.