Open Owen-Liuyuxuan opened 3 years ago
Cuda 11 is supported I think. Try using my distributed launching script and set num of gpus to be 1.
Best, Xiaoyang
在 2021年8月20日,上午11:51,Yuxuan Liu @.***> 写道:
Thank you for your great contribution.
CUDA 11.0?
I do manage to compile everything in a docker with CUDA 11.0/pytorch 1.7.1. including spconv (it seems that spconv show no error in build and install)
But after it start training for the first step, the code ends with error:
CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>. Then I rewrite your code for single GPU training without distributed training (the re-written code is in my fork repo). Everything looks the same and it turns out to be a segmentation fault.
python3 tools/train.py --cfg configs/stereo/kitti_models/liga.3d-and-bev.yaml --launcher=none --batch_size 1
Segmentation fault (core dumped) I have not fully investigated where does it happen.
CUDA 10
I then try using a lower CUDA version, but 3090 only supports CUDA 11+, and the current model is too large to fit into a single 1080Ti/2080Ti (similar to DSGN?).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
In my first try, I used the original launching script and it failed without any additional information.
CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.
I then started without distributed because I want to find out the error, and it turns out to be a segmentation fault.
epochs: 0%| | 0/60 [00:00<?, ?it/s]
{'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}
filter truncated ratio: null 3d boxes [[ 2.93 -4.66 -0.73 4.18 1.86 1.48
-1.6307963]] flipped False image idx 1040 frame_id 002080
{'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98} | 0/3712 [00:00<?, ?it/s]
filter truncated ratio: null 3d boxes [[ 2.93 -4.66 -0.73 4.18 1.86 1.48
-1.6307963]] flipped False image idx 1040 frame_id 002080
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.
It's weird. Usually it will output more error messages. btw, did you pull the latest commit?
The error happened in here
x = self.conv_input(input_sp_tensor)
However, I did not see any error during my compilation and installation of spconv.
>>> torch.__version__
'1.7.1+cu110'
>>> torch.version.cuda
'11.0'
The possible reasons might be:
Can you do some double check?
The problem maybe that my nvcc version is 11.1 while everything else is 11.0 I need nvcc 11.1+ to install mmcv-full on 3090 (nvcc 11.0 does not support 3090). However, pytorch 1.7.1 does not have cu110 prebuilt wheel. It is rather troublesome.
I think you can use the latest pytorch version
@Owen-Liuyuxuan Hi, have you tried the latest Pytorch/CUDA version?
@Owen-Liuyuxuan Hi, have you tried the latest Pytorch/CUDA version?
Sorry I have not been working on this for a while :( and have not tried that.
Docker environment:
torch==1.9.1+cu111 torchvision==0.10.1+cu111 mmcv-full=1.2.0 nvcc==11.1.TC455_06 on a RTX 3090 server.
run command:
CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
+ python3 -m torch.distributed.launch --nproc_per_node=1 tools/train.py --launcher pytorch --fix_random_seed --sync_bn --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name exp_name
run command:
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py --launcher none --fix_random_seed --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name debug
It starts but still produces segmentation fault and stop here similar to the original result
Can you try run the code step by step to see which step?
Best, Xiaoyang
在 2021年9月28日,下午3:06,Yuxuan Liu @.***> 写道:
Docker environment:
torch==1.9.1+cu111 torchvision==0.10.1+cu111 mmcv-full=1.2.0 nvcc==11.1.TC455_06 on a RTX 3090 server. run command:
CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
- python3 -m torch.distributed.launch --nproc_per_node=1 tools/train.py --launcher pytorch --fix_random_seed --sync_bn --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name exp_name freezes and no output. ctrl+c: not much useful information comes out. run command:
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py --launcher none --fix_random_seed --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name debug It starts but still produces segmentation fault and stop here similar to the original result
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
I have tried that (by sync and printing along the way), and it stops here:
x = self.conv_input(input_sp_tensor)
https://github.com/xy-guo/LIGA-Stereo/blob/master/liga/models/backbones_3d_lidar/spconv_backbone.py#L385 which is a direct call to the spconv library.
I'm not sure what causes the problem. I've tested my code on a 3070 notebook and everything is fine. I'm not sure if there is a possibility that docker causes the problem?
Another suggestion is that do not use --launcher none, the code is only available in distributed mode.
Another suggestion is that do not use --launcher none, the code is only available in distributed mode.
The problem is that if the code is launch in distributed mode, I can not get any error message (and any other training logs) and the child process just dies... I have to run in local mode to actually debug.
I have the same question in a docker with CUDA 10.1/pytorch 1.6.0, do you have salved it?
Have you solved the problem? Maybe you can try using the latest commit of spconv?
Have you solved the problem? Maybe you can try using the latest commit of spconv?
I have tried following your advice, but it is still the same as before. Now my CUDA 10.2, install spconv by offical 'pip install spconv-cu102' , I will try it in CUDA 11.1.
Hi,
I faced this problem too. My env is: ubuntu=20.0.6, python=3.7, cuda=11.1, pytorch=1.7.1. My GPU is RTX 8000.
Command I run was: ./scripts/dist_test_ckpt.sh 1 ./configs/stereo/kitti_models/liga.3d-and-bev.yaml ./ckpt/released.final.liga.3d-and-bev.ep53.pth
Pip list is as follows: Package Version Location
addict 2.4.0 certifi 2021.10.8 cycler 0.11.0 Cython 0.29.28 easydict 1.9 fire 0.4.0 fonttools 4.28.2 imageio 2.16.1 kiwisolver 1.3.2 liga 0.1.0+aee3731 /home/qingwu/LIGA-Stereo llvmlite 0.38.0 matplotlib 3.5.0 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mmcv-full 1.2.0 mmdet 2.6.0 /home/qingwu/LIGA-Stereo/mmdetection_kitti mmpycocotools 12.0.3 networkx 2.6.3 numba 0.55.1 numpy 1.21.5 opencv-python 4.5.5.64 packaging 21.3 Pillow 9.0.1 pip 21.2.2 protobuf 3.19.4 pycocotools 2.0 pyparsing 3.0.6 python-dateutil 2.8.2 PyWavelets 1.3.0 PyYAML 5.4.1 scikit-image 0.19.2 scipy 1.7.3 setuptools 58.0.4 setuptools-scm 6.3.2 six 1.16.0 spconv 1.2.1 tensorboardX 2.5 termcolor 1.1.0 terminaltables 3.1.10 tifffile 2021.11.2 tomli 1.2.2 torch 1.7.1 torchaudio 0.7.0a0+a853dff torchvision 0.8.2 tqdm 4.63.1 typing_extensions 4.1.1 wheel 0.37.1 yapf 0.32.0
The error logs are as follows:
size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv1.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). unexpected key in source state_dict: fc.weight, fc.bias, layer3.0.downsample.0.weight, layer3.0.downsample.1.running_mean, layer3.0.downsample.1.running_var, layer3.0.downsample.1.weight, layer3.0.downsample.1.bias, layer4.0.downsample.0.weight, layer4.0.downsample.1.running_mean, layer4.0.downsample.1.running_var, layer4.0.downsample.1.weight, layer4.0.downsample.1.bias
2022-03-24 22:10:59,122 INFO ** Model create finished **
2022-03-24 22:10:59,123 INFO ** Load checkpoint **
2022-03-24 22:10:59,123 INFO ==> Loading parameters from checkpoint ./ckpt/released.final.liga.3d-and-bev.ep53.pth to CPU
2022-03-24 22:10:59,157 INFO ==> Checkpoint trained from version: liga+0.1.0+7aa7b92+py72af526
2022-03-24 22:11:00,163 INFO ==> Done (loaded 484/484)
2022-03-24 22:11:00,182 INFO ** Start evaluation **
2022-03-24 22:11:00,182 INFO * EPOCH 53 EVALUATION ***
eval: 0%| | 0/3769 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
Any ideas? Thanks in advance.
Same fault with CUDA11.1 and pytorch==1.8.0
Hi,
I faced this problem too. My env is: ubuntu=20.0.6, python=3.7, cuda=11.1, pytorch=1.7.1. My GPU is RTX 8000.
Command I run was: ./scripts/dist_test_ckpt.sh 1 ./configs/stereo/kitti_models/liga.3d-and-bev.yaml ./ckpt/released.final.liga.3d-and-bev.ep53.pth
Pip list is as follows: Package Version Location
addict 2.4.0 certifi 2021.10.8 cycler 0.11.0 Cython 0.29.28 easydict 1.9 fire 0.4.0 fonttools 4.28.2 imageio 2.16.1 kiwisolver 1.3.2 liga 0.1.0+aee3731 /home/qingwu/LIGA-Stereo llvmlite 0.38.0 matplotlib 3.5.0 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mmcv-full 1.2.0 mmdet 2.6.0 /home/qingwu/LIGA-Stereo/mmdetection_kitti mmpycocotools 12.0.3 networkx 2.6.3 numba 0.55.1 numpy 1.21.5 opencv-python 4.5.5.64 packaging 21.3 Pillow 9.0.1 pip 21.2.2 protobuf 3.19.4 pycocotools 2.0 pyparsing 3.0.6 python-dateutil 2.8.2 PyWavelets 1.3.0 PyYAML 5.4.1 scikit-image 0.19.2 scipy 1.7.3 setuptools 58.0.4 setuptools-scm 6.3.2 six 1.16.0 spconv 1.2.1 tensorboardX 2.5 termcolor 1.1.0 terminaltables 3.1.10 tifffile 2021.11.2 tomli 1.2.2 torch 1.7.1 torchaudio 0.7.0a0+a853dff torchvision 0.8.2 tqdm 4.63.1 typing_extensions 4.1.1 wheel 0.37.1 yapf 0.32.0
The error logs are as follows:
- python -m torch.distributed.launch --nproc_per_node=1 tools/test.py --launcher pytorch --save_to_file --cfg_file ./configs/stereo/kitti_models/liga.3d-and-bev.yaml --ckpt ./ckpt/released.final.liga.3d-and-bev.ep53.pth 2022-03-24 22:10:58,747 INFO **Start logging** 2022-03-24 22:10:58,747 INFO CUDA_VISIBLE_DEVICES=ALL 2022-03-24 22:10:58,747 INFO eval output dir: ckpt/released.final.liga.3d-and-bev.ep53.pth.eval/eval/epoch_53/val/default 2022-03-24 22:10:58,747 INFO total_batch_size: 1 2022-03-24 22:10:58,747 INFO cfg_file ./configs/stereo/kitti_models/liga.3d-and-bev.yaml 2022-03-24 22:10:58,747 INFO batch_size 1 2022-03-24 22:10:58,747 INFO workers 2 2022-03-24 22:10:58,747 INFO exp_name None 2022-03-24 22:10:58,747 INFO eval_tag default 2022-03-24 22:10:58,747 INFO max_waiting_mins 30 2022-03-24 22:10:58,747 INFO save_to_file True 2022-03-24 22:10:58,747 INFO ckpt ./ckpt/released.final.liga.3d-and-bev.ep53.pth 2022-03-24 22:10:58,747 INFO ckpt_id None 2022-03-24 22:10:58,747 INFO start_epoch 0 2022-03-24 22:10:58,747 INFO launcher pytorch 2022-03-24 22:10:58,747 INFO tcp_port 18888 2022-03-24 22:10:58,747 INFO local_rank 0 2022-03-24 22:10:58,747 INFO set_cfgs None 2022-03-24 22:10:58,747 INFO trainval False 2022-03-24 22:10:58,748 INFO imitation 2d 2022-03-24 22:10:58,748 INFO cfg.ROOT_DIR: /home/qingwu/LIGA-Stereo 2022-03-24 22:10:58,748 INFO cfg.LOCAL_RANK: 0 2022-03-24 22:10:58,748 INFO cfg.CLASS_NAMES: ['Car', 'Pedestrian', 'Cyclist'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATASET: StereoKittiDataset 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_PATH: ./data/kitti 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.FLIP: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.FORCE_FLIP: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_CLOUD_RANGE: [2, -30.4, -3, 59.6, 30.4, 1] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.VOXEL_SIZE: [0.05, 0.05, 0.1] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.STEREO_VOXEL_SIZE: [0.2, 0.2, 0.2] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_SPLIT = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_SPLIT.train: train 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_SPLIT.test: val 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.INFO_PATH = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.INFO_PATH.train: ['kitti_infos_train.pkl'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.INFO_PATH.test: ['kitti_infos_val.pkl'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.USE_VAN: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.USE_PERSON_SITTING: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.FOV_POINTS_ONLY: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.BOXES_GT_IN_CAM2_VIEW: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.GENERATE_CORNER_HEATMAP: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.CAT_REFLECT_DIM: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.TRAIN_DATA_AUGMENTOR: [{'NAME': 'random_crop', 'MIN_REL_X': 0, 'MAX_REL_X': 0, 'MIN_REL_Y': 1.0, 'MAX_REL_Y': 1.0, 'MAX_CROP_H': 320, 'MAX_CROP_W': 1280}, {'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.TEST_DATA_AUGMENTOR: [{'NAME': 'random_crop', 'MIN_REL_X': 0, 'MAX_REL_X': 0, 'MIN_REL_Y': 1.0, 'MAX_REL_Y': 1.0, 'MAX_CROP_H': 320, 'MAX_CROP_W': 1280}] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.encoding_type: absolute_coordinates_encoding 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.used_feature_list: ['x', 'y', 'z'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.src_feature_list: ['x', 'y', 'z'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_PROCESSOR: [{'NAME': 'mask_points_and_boxes_outside_range', 'REMOVE_OUTSIDE_BOXES': True}, {'NAME': 'transform_points_to_voxels', 'VOXEL_SIZE': [0.05, 0.05, 0.1], 'MAX_POINTS_PER_VOXEL': 5, 'MAX_NUMBER_OF_VOXELS': {'train': 40000, 'test': 40000}}] 2022-03-24 22:10:58,749 INFO cfg.DATA_CONFIG._BASECONFIG: ./configs/stereo/dataset_configs/kitti_dataset_fused.yaml 2022-03-24 22:10:58,749 INFO cfg.MODEL = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.NAME: stereo_LIGA 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.NAME: SECONDNet 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.RETURN_BATCH_DICT: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.PRETRAINED_MODEL: ./ckpt/second_s4_hg.iouloss.ep78.backbone-no-final-bnrelu.input-only-xyz.default-lr-policy-with-wd-decay-78ep.pth 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.VFE = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.VFE.NAME: MeanVFE 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_3D = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_3D.NAME: VoxelBackBone4xNoFinalBnReLU 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.MAP_TO_BEV = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.MAP_TO_BEV.NAME: HeightCompression 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 160 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D.NAME: HgBEVBackbone 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D.num_channels: 64 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D.GN: False 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.NAME: LigaBackbone 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.maxdisp: 288 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.downsample_disp: 4 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.GN: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.img_feature_attentionbydisp: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.voxel_attentionbydisp: False 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.cat_img_feature: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.num_3dconvs: 1 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.type: ResNet 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.depth: 34 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.num_stages: 4 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.out_indices: [0, 1, 2, 3] 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.frozen_stages: -1 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_cfg = edict() 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_cfg.type: BN 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_cfg.requires_grad: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_eval: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.style: pytorch 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.with_max_pool: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.deep_stem: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.block_with_final_relu: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.base_channels: 64 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.strides: [1, 2, 1, 1] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.dilations: [1, 1, 2, 4] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.num_channels_factor: [1, 2, 2, 2] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone_pretrained: torchvision://resnet34 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck = edict() 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.GN: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.in_dims: [3, 64, 128, 128, 128] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.start_level: 2 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.stereo_dim: [32, 32] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.with_upconv: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.cat_img_feature: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.sem_dim: [128, 32] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck = edict() 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.type: FPN 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.in_channels: [32] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.out_channels: 64 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.start_level: 0 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.add_extra_convs: on_output 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.num_outs: 5 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.cost_volume: [{'type': 'concat', 'downsample': 4}] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.cv_dim: 32 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.rpn3d_dim: 32 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.downsampled_depth_offset: 0.5 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.use_stereo_out_type: feature 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.num_hg: 1 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.NAME: MMDet2DHead 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.use_3d_center: True 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.type: ATSSAdvHead 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.reg_class_agnostic: False 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.seperate_extra_reg_branch: False 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.num_classes: 3 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.in_channels: 64 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.stacked_convs: 4 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.feat_channels: 64 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.type: AnchorGenerator 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.ratios: [1.0] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.octave_base_scale: 16 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.scales_per_octave: 1 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.strides: [4, 8, 16, 32, 64] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.num_extra_reg_channel: 0 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder.type: DeltaXYWHBBoxCoder 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder.target_means: [0.0, 0.0, 0.0, 0.0] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder.target_stds: [0.1, 0.1, 0.2, 0.2] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.type: FocalLoss 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.use_sigmoid: True 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.gamma: 2.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.alpha: 0.25 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.loss_weight: 1.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_bbox = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_bbox.type: GIoULoss 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_bbox.loss_weight: 2.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness.type: CrossEntropyLoss 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness.use_sigmoid: True 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness.loss_weight: 1.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.assigner = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.assigner.type: ATSS3DCenterAssigner 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.assigner.topk: 9 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.allowed_border: -1 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.pos_weight: -1 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.append_3d_centers: True 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.debug: False 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms_pre: 1000 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.min_bbox_size: 0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.score_thr: 0.05 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms.type: nms 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms.iou_threshold: 0.6 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.max_per_img: 100 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV.NAME: HeightCompression 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 160 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV.SPARSE_INPUT: False 2022-03-24 22:10:58,752 INFO cfg.MODEL.BACKBONE_2D = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.BACKBONE_2D.NAME: HgBEVBackbone 2022-03-24 22:10:58,752 INFO cfg.MODEL.BACKBONE_2D.num_channels: 64 2022-03-24 22:10:58,753 INFO cfg.MODEL.BACKBONE_2D.GN: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.NAME: DetHead 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.NUM_CONVS: 2 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.GN: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.CLASS_AGNOSTIC: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.USE_DIRECTION_CLASSIFIER: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.DIR_OFFSET: 0.78539 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.DIR_LIMIT_OFFSET: 0.0 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.NUM_DIR_BINS: 2 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.CLAMP_VALUE: 10.0 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.xyz_for_angles: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.hwl_for_angles: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.do_feature_imitation: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.imitation_cfg: [{'lidar_feature_layer': 'spatial_features_2d', 'stereo_feature_layer': 'spatial_features_2d', 'normalize': 'cw_scale', 'layer': 'conv2d', 'channel': 64, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}, {'lidar_feature_layer': 'volume_features', 'stereo_feature_layer': 'volume_features', 'normalize': 'cw_scale', 'layer': 'conv3d', 'channel': 32, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}] 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.ANCHOR_GENERATOR_CONFIG: [{'class_name': 'Car', 'anchor_sizes': [[3.9, 1.6, 1.56]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.78], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.6, 'unmatched_threshold': 0.45}, {'class_name': 'Pedestrian', 'anchor_sizes': [[0.8, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'Cyclist', 'anchor_sizes': [[1.76, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}] 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NAME: AxisAlignedTargetAssigner 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.POS_FRACTION: -1.0 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.SAMPLE_SIZE: 512 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NORM_BY_NUM_EXAMPLES: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.MATCH_HEIGHT: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER: ResidualCoder 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.div_by_diagonal: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.use_corners: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.use_tanh: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.REG_LOSS_TYPE: WeightedSmoothL1Loss 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.IOU_LOSS_TYPE: IOU3dLoss 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.IMITATION_LOSS_TYPE: WeightedL2WithSigmaLoss 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.cls_weight: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.loc_weight: 0.5 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.dir_weight: 0.2 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.iou_weight: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.imitation_weight: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.code_weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD.LOSS_TYPE = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD.LOSS_TYPE.ce: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD.WEIGHTS: [1.0] 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.RECALL_THRESH_LIST: [0.3, 0.5, 0.7] 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.SCORE_THRESH: 0.1 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.OUTPUT_RAW_SCORE: False 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.EVAL_METRIC: kitti 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.MULTI_CLASSES_NMS: True 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_TYPE: nms_gpu 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_THRESH: 0.25 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_PRE_MAXSIZE: 4096 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_POST_MAXSIZE: 500 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION = edict() 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.BATCH_SIZE_PER_GPU: 1 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.NUM_EPOCHS: 60 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.OPTIMIZER: adamw 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.LR: 0.001 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.WEIGHT_DECAY: 0.0001 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.MOMENTUM: 0.9 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.DIV_FACTOR: 10 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.DECAY_STEP_LIST: [50] 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.LR_DECAY: 0.1 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.LR_CLIP: 1e-07 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.LR_WARMUP: True 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.WARMUP_EPOCH: 1 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.GRAD_NORM_CLIP: 10 2022-03-24 22:10:58,755 INFO cfg.TAG: liga.3d-and-bev 2022-03-24 22:10:58,755 INFO cfg.EXP_GROUP_PATH: configs_stereo_kitti_models 2022-03-24 22:10:58,775 INFO boxes_gt_in_cam2_view False 2022-03-24 22:10:58,775 INFO Loading KITTI dataset 2022-03-24 22:10:58,874 INFO Total samples for KITTI dataset: 3769 2022-03-24 22:10:58,874 INFO **Creating model ** 2022-03-24 22:10:58,874 INFO **MODEL name is: {'NAME': 'stereo_LIGA', 'LIDAR_MODEL': {'NAME': 'SECONDNet', 'RETURN_BATCH_DICT': True, 'PRETRAINED_MODEL': './ckpt/second_s4_hg.iouloss.ep78.backbone-no-final-bnrelu.input-only-xyz.default-lr-policy-with-wd-decay-78ep.pth', 'VFE': {'NAME': 'MeanVFE'}, 'BACKBONE_3D': {'NAME': 'VoxelBackBone4xNoFinalBnReLU'}, 'MAP_TO_BEV': {'NAME': 'HeightCompression', 'NUM_BEV_FEATURES': 160}, 'BACKBONE_2D': {'NAME': 'HgBEVBackbone', 'num_channels': 64, 'GN': False}}, 'BACKBONE_3D': {'NAME': 'LigaBackbone', 'maxdisp': 288, 'downsample_disp': 4, 'GN': True, 'img_feature_attentionbydisp': True, 'voxel_attentionbydisp': False, 'cat_img_feature': True, 'num_3dconvs': 1, 'feature_backbone': {'type': 'ResNet', 'depth': 34, 'num_stages': 4, 'out_indices': [0, 1, 2, 3], 'frozen_stages': -1, 'norm_cfg': {'type': 'BN', 'requires_grad': True}, 'norm_eval': False, 'style': 'pytorch', 'with_max_pool': False, 'deep_stem': False, 'block_with_final_relu': False, 'base_channels': 64, 'strides': [1, 2, 1, 1], 'dilations': [1, 1, 2, 4], 'num_channels_factor': [1, 2, 2, 2]}, 'feature_backbone_pretrained': 'torchvision://resnet34', 'feature_neck': {'GN': True, 'in_dims': [3, 64, 128, 128, 128], 'start_level': 2, 'stereo_dim': [32, 32], 'with_upconv': True, 'cat_img_feature': True, 'sem_dim': [128, 32]}, 'sem_neck': {'type': 'FPN', 'in_channels': [32], 'out_channels': 64, 'start_level': 0, 'add_extra_convs': 'on_output', 'num_outs': 5}, 'cost_volume': [{'type': 'concat', 'downsample': 4}], 'cv_dim': 32, 'rpn3d_dim': 32, 'downsampled_depth_offset': 0.5, 'use_stereo_out_type': 'feature', 'num_hg': 1}, 'DENSE_HEAD_2D': {'NAME': 'MMDet2DHead', 'use_3d_center': True, 'cfg': {'type': 'ATSSAdvHead', 'reg_class_agnostic': False, 'seperate_extra_reg_branch': False, 'num_classes': 3, 'in_channels': 64, 'stacked_convs': 4, 'feat_channels': 64, 'anchor_generator': {'type': 'AnchorGenerator', 'ratios': [1.0], 'octave_base_scale': 16, 'scales_per_octave': 1, 'strides': [4, 8, 16, 32, 64]}, 'num_extra_reg_channel': 0, 'bbox_coder': {'type': 'DeltaXYWHBBoxCoder', 'target_means': [0.0, 0.0, 0.0, 0.0], 'target_stds': [0.1, 0.1, 0.2, 0.2]}, 'loss_cls': {'type': 'FocalLoss', 'use_sigmoid': True, 'gamma': 2.0, 'alpha': 0.25, 'loss_weight': 1.0}, 'loss_bbox': {'type': 'GIoULoss', 'loss_weight': 2.0}, 'loss_centerness': {'type': 'CrossEntropyLoss', 'use_sigmoid': True, 'loss_weight': 1.0}, 'train_cfg': {'assigner': {'type': 'ATSS3DCenterAssigner', 'topk': 9}, 'allowed_border': -1, 'pos_weight': -1, 'append_3d_centers': True, 'debug': False}, 'test_cfg': {'nms_pre': 1000, 'min_bbox_size': 0, 'score_thr': 0.05, 'nms': {'type': 'nms', 'iou_threshold': 0.6}, 'max_per_img': 100}}}, 'MAP_TO_BEV': {'NAME': 'HeightCompression', 'NUM_BEV_FEATURES': 160, 'SPARSE_INPUT': False}, 'BACKBONE_2D': {'NAME': 'HgBEVBackbone', 'num_channels': 64, 'GN': True}, 'DENSE_HEAD': {'NAME': 'DetHead', 'NUM_CONVS': 2, 'GN': True, 'CLASS_AGNOSTIC': False, 'USE_DIRECTION_CLASSIFIER': True, 'DIR_OFFSET': 0.78539, 'DIR_LIMIT_OFFSET': 0.0, 'NUM_DIR_BINS': 2, 'CLAMP_VALUE': 10.0, 'xyz_for_angles': True, 'hwl_for_angles': True, 'do_feature_imitation': True, 'imitation_cfg': [{'lidar_feature_layer': 'spatial_features_2d', 'stereo_feature_layer': 'spatial_features_2d', 'normalize': 'cw_scale', 'layer': 'conv2d', 'channel': 64, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}, {'lidar_feature_layer': 'volume_features', 'stereo_feature_layer': 'volume_features', 'normalize': 'cw_scale', 'layer': 'conv3d', 'channel': 32, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}], 'ANCHOR_GENERATOR_CONFIG': [{'class_name': 'Car', 'anchor_sizes': [[3.9, 1.6, 1.56]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.78], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.6, 'unmatched_threshold': 0.45}, {'class_name': 'Pedestrian', 'anchor_sizes': [[0.8, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'Cyclist', 'anchor_sizes': [[1.76, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}], 'TARGET_ASSIGNER_CONFIG': {'NAME': 'AxisAlignedTargetAssigner', 'POS_FRACTION': -1.0, 'SAMPLE_SIZE': 512, 'NORM_BY_NUM_EXAMPLES': False, 'MATCH_HEIGHT': False, 'BOX_CODER': 'ResidualCoder', 'BOX_CODER_CONFIG': {'div_by_diagonal': True, 'use_corners': False, 'use_tanh': False}}, 'LOSS_CONFIG': {'REG_LOSS_TYPE': 'WeightedSmoothL1Loss', 'IOU_LOSS_TYPE': 'IOU3dLoss', 'IMITATION_LOSS_TYPE': 'WeightedL2WithSigmaLoss', 'LOSS_WEIGHTS': {'cls_weight': 1.0, 'loc_weight': 0.5, 'dir_weight': 0.2, 'iou_weight': 1.0, 'imitation_weight': 1.0, 'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}}, 'DEPTH_LOSS_HEAD': {'LOSS_TYPE': {'ce': 1.0}, 'WEIGHTS': [1.0]}, 'POST_PROCESSING': {'RECALL_THRESH_LIST': [0.3, 0.5, 0.7], 'SCORE_THRESH': 0.1, 'OUTPUT_RAW_SCORE': False, 'EVAL_METRIC': 'kitti', 'NMS_CONFIG': {'MULTI_CLASSES_NMS': True, 'NMS_TYPE': 'nms_gpu', 'NMS_THRESH': 0.25, 'NMS_PRE_MAXSIZE': 4096, 'NMS_POST_MAXSIZE': 500}}} ** 2022-03-24 22:10:58,884 INFO ==> Loading parameters from checkpoint ./ckpt/second_s4_hg.iouloss.ep78.backbone-no-final-bnrelu.input-only-xyz.default-lr-policy-with-wd-decay-78ep.pth to CPU 2022-03-24 22:10:58,897 INFO ==> Checkpoint trained from version: liga+0.1.0+7aa7b92+py60b444b 2022-03-24 22:10:58,948 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.conv_cls.weight: torch.Size([18, 64, 3, 3]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_cls.bias: torch.Size([18]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_box.weight: torch.Size([42, 64, 3, 3]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_box.bias: torch.Size([42]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_dir_cls.weight: torch.Size([12, 64, 1, 1]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_dir_cls.bias: torch.Size([12]) 2022-03-24 22:10:58,950 INFO ==> Done (loaded 110/110) stereo volume depth range: 2.0 -> 59.599998474121094, interval 0.19999999470180935 2022-03-24 22:10:59,105 - mmdet - WARNING - The model and loaded state dict do not match exactly
size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv1.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). unexpected key in source state_dict: fc.weight, fc.bias, layer3.0.downsample.0.weight, layer3.0.downsample.1.running_mean, layer3.0.downsample.1.running_var, layer3.0.downsample.1.weight, layer3.0.downsample.1.bias, layer4.0.downsample.0.weight, layer4.0.downsample.1.running_mean, layer4.0.downsample.1.running_var, layer4.0.downsample.1.weight, layer4.0.downsample.1.bias
2022-03-24 22:10:59,122 INFO ** Model create finished ** 2022-03-24 22:10:59,123 INFO ** Load checkpoint ** 2022-03-24 22:10:59,123 INFO ==> Loading parameters from checkpoint ./ckpt/released.final.liga.3d-and-bev.ep53.pth to CPU 2022-03-24 22:10:59,157 INFO ==> Checkpoint trained from version: liga+0.1.0+7aa7b92+py72af526 2022-03-24 22:11:00,163 INFO ==> Done (loaded 484/484) 2022-03-24 22:11:00,182 INFO ** Start evaluation ** 2022-03-24 22:11:00,182 INFO * EPOCH 53 EVALUATION *** eval: 0%| | 0/3769 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/qingwu/anaconda3/envs/liga_cuda111/bin/python', '-u', 'tools/test.py', '--local_rank=0', '--launcher', 'pytorch', '--save_to_file', '--cfg_file', './configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--ckpt', './ckpt/released.final.liga.3d-and-bev.ep53.pth']' died with <Signals.SIGSEGV: 11>.
Any ideas? Thanks in advance.
Hi, have you solved this problem? I meet same error massages.
Same problem with nvcc 10.1, nvidia-smi 10.2, pytorch 1.6.0 + cudatoolkit 10.1, mmcvfull 1.2.1, mmdet 2.6.0 and graphic cards Tesla V100s I've worked on it for few days and still can not solve this problem :(
Thank you for your great contribution.
CUDA 11.0?
I do manage to compile everything in a docker with CUDA 11.0/pytorch 1.7.1. including spconv (it seems that spconv show no error in build and install)
But after it start training for the first step, the code ends with error:
Then I rewrite your code for single GPU training without distributed training (the re-written code is in my fork repo). Everything looks the same and it turns out to be a segmentation fault.
I have not fully investigated where does it happen.
CUDA 10
I then try using a lower CUDA version, but 3090 only supports CUDA 11+, and the current model is too large to fit into a single 1080Ti/2080Ti (similar to DSGN?).