Closed zgojcic closed 3 years ago
Hi, Zan. Nice work! I want to run your code. My GPU is 3090, so it only supports Cuda 11. Is this problem solved? Thank you.
Hi, yeah this is how we actually first saw the problem (with a 3090). I think that it is not solved yet but Chris is usually very fast with these things so it should be quick. I will update this issue once it is solved.
OK, great, thank you very much.
Hi, Zan. I have tested the code in my computer, with Ubuntu20.04, CUDA 11.1, RTX 3090 GPU, MinkowskiEngine-0.5.2, PyTorch 1.8, Python 3.7. I run the training code, it shows the error as follows.
python train.py ./configs/train/train_fully_supervised.yaml /home/ps/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine/init.py:42: UserWarning: The environment variable
OMP_NUM_THREADS
not set. MinkowskiEngine will automatically setOMP_NUM_THREADS=16
. If you want to setOMP_NUM_THREADS
manually, please export it on the command line before running a python script. e.g.export OMP_NUM_THREADS=12; python your_program.py
. It is recommended to set it below 24. "It is recommended to set it below 24.", Using /home/ps/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ps/.cache/torch_extensions/cd/build.ninja... Building extension module cd... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cd... 2021-03-29 10:21:52 zml root[103079] INFO Command: train.py ./configs/train/train_fully_supervised.yaml 2021-03-29 10:21:52 zml root[103079] INFO Arguments: method_backbone: ME, method_flow: True, method_ego_motion: False, method_semantic: False, method_clustering: False, misc_voxel_size: 0.1, misc_num_points: 8192, misc_trainer: FlowTrainer, misc_use_gpu: True, misc_log_dir: ./logs/, misc_run_mode: train, data_input_features: absolute_coords, data_only_near_points: True, data_dataset: FlyingThings3D_ME, data_root: /media/ps/data/rigid_scene_flow_dataset/flying_things_3d/, data_remove_ground: False, data_augment_data: False, train_batch_size: 8, train_acc_iter_size: 1, train_num_workers: 6, train_max_epoch: 50, train_stat_interval: 5, train_chkpt_interval: 40, train_val_interval: 20, train_weighted_seg_loss: True, val_batch_size: 8, val_num_workers: 6, test_results_dir: ./eval/, test_batch_size: 1, test_num_workers: 1, loss_bg_loss_w: 1.0, loss_fg_loss_w: 1.0, loss_flow_loss_w: 1.0, loss_ego_loss_w: 1.0, loss_inlier_loss_w: 0.005, loss_cd_loss_w: 0.5, loss_rigid_loss_w: 1.0, loss_background_loss: False, loss_flow_loss: True, loss_ego_loss: False, loss_foreground_loss: False, optimizer_alg: Adam, optimizer_learning_rate: 0.001, optimizer_weight_decay: 0.0, optimizer_momentum: 0.8, optimizer_scheduler: ExponentialLR, optimizer_exp_gamma: 0.98, network_normalize_features: True, network_norm_type: IN, network_in_kernel_size: 7, network_feature_dim: 64, network_use_pretrained: True, network_pretrained_path: , metrics_flow: True, metrics_ego_motion: False, metrics_semantic: False 2021-03-29 10:21:52 zml root[103079] INFO Output and logs will be saved to ./logs/logs_FlyingThings3D_ME/21_03_29-10_21_52_431116Method_ME_FlowVoxSize_0.1__Pts_8192 2021-03-29 10:21:52 zml root[103079] INFO Parameter Count: 8073729 2021-03-29 10:21:52 zml root[103079] INFO Torch version: 1.8.0 2021-03-29 10:21:52 zml root[103079] INFO CUDA version: 11.1 2021-03-29 10:21:52 zml root[103079] INFO Training epoch: 0, LR: [0.001] 0%| | 0/1963 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 243, inmain(cfg, args.config) File "train.py", line 136, in main losses, metrics, total_loss = trainer.train_step(batch) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/trainer.py", line 45, in train_step losses, metrics = self._compute_loss_metrics(data) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/trainer.py", line 152, in _compute_loss_metrics inferred_values = self.model(sinput1, sinput2, input_dict['pcd_eval_s'], input_dict['pcd_eval_t'], input_dict['fg_labels_s'], input_dict['fg_labels_t']) File "/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/model/rigid_3d_sf.py", line 308, in forward self._infer_flow(dec_feat_1, dec_feat_2) File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/model/rigid_3d_sf.py", line 102, in _infer_flow feat_s = flow_f_1.F[flow_f_1.C[:,0] == b_idx] RuntimeError: invalid shape dimension -1122459258 0%| | 0/1963 [00:02<?, ?it/s]
Yes, it shows the invalid shape dimension. I also guess this is the problem of MinkowskiEngine.
Hei, yes it seems that this is the same error. In the ME thread there was a comment that the code should work with ME 0.5 (even with CUDA 11.x) so maybe you can try that.
OK, thank you. I will try it.
It seems that this is a problem with the combination of pytorch 1.8.x and Cuda 11.x and not a ME bug. Until fixed I suggest using pytorch 1.7.1 with CUDA 11.X. The updates can also be followed in the issue refferenced in the first post of this thread.
Closing due to inactivity. Please open a new issue if you have further questions.
Hi @zgojcic , from my understanding 30 series gpu only work with CUDA 11.1 up but pytorch 1.7.1 only works with CUDA 11.0. How did you get pytorch 1.7.1 to work with a CUDA version that is compatible with 30 series gpu? Thanks in advance.
When using Cuda 11 our model returns the following error:
It seems that this is due to the combination of Cuda 11 with MinkowskiEngine. The issue is currently under investigation https://github.com/NVIDIA/MinkowskiEngine/issues/330
Until solved we suggest using Cuda 10.2 or 10.1.