xy-guo / LIGA-Stereo

Code for LIGA-Stereo Detector, ICCV'21
Apache License 2.0
91 stars 18 forks source link

Segmentation fault on CUDA 11.0/torch 1.7.1 #1

Open Owen-Liuyuxuan opened 3 years ago

Owen-Liuyuxuan commented 3 years ago

Thank you for your great contribution.

CUDA 11.0?

I do manage to compile everything in a docker with CUDA 11.0/pytorch 1.7.1. including spconv (it seems that spconv show no error in build and install)

But after it start training for the first step, the code ends with error:

CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml

subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.

Then I rewrite your code for single GPU training without distributed training (the re-written code is in my fork repo). Everything looks the same and it turns out to be a segmentation fault.

python3 tools/train.py --cfg configs/stereo/kitti_models/liga.3d-and-bev.yaml --launcher=none --batch_size 1

Segmentation fault (core dumped) 

I have not fully investigated where does it happen.

CUDA 10

I then try using a lower CUDA version, but 3090 only supports CUDA 11+, and the current model is too large to fit into a single 1080Ti/2080Ti (similar to DSGN?).

xy-guo commented 3 years ago

Cuda 11 is supported I think. Try using my distributed launching script and set num of gpus to be 1.

Best, Xiaoyang

在 2021年8月20日,上午11:51,Yuxuan Liu @.***> 写道:

 Thank you for your great contribution.

CUDA 11.0?

I do manage to compile everything in a docker with CUDA 11.0/pytorch 1.7.1. including spconv (it seems that spconv show no error in build and install)

But after it start training for the first step, the code ends with error:

CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml

subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>. Then I rewrite your code for single GPU training without distributed training (the re-written code is in my fork repo). Everything looks the same and it turns out to be a segmentation fault.

python3 tools/train.py --cfg configs/stereo/kitti_models/liga.3d-and-bev.yaml --launcher=none --batch_size 1

Segmentation fault (core dumped) I have not fully investigated where does it happen.

CUDA 10

I then try using a lower CUDA version, but 3090 only supports CUDA 11+, and the current model is too large to fit into a single 1080Ti/2080Ti (similar to DSGN?).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Owen-Liuyuxuan commented 3 years ago

In my first try, I used the original launching script and it failed without any additional information.

CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml

subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.

I then started without distributed because I want to find out the error, and it turns out to be a segmentation fault.

Owen-Liuyuxuan commented 3 years ago
epochs:   0%|                                                                                                                                                                                                                                                      | 0/60 [00:00<?, ?it/s]
{'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}
filter truncated ratio: null 3d boxes [[ 2.93      -4.66      -0.73       4.18       1.86       1.48
  -1.6307963]] flipped False image idx 1040 frame_id 002080 

                                                                                                                                                                                                                                                                                         {'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}                                                                                                                                                | 0/3712 [00:00<?, ?it/s]
filter truncated ratio: null 3d boxes [[ 2.93      -4.66      -0.73       4.18       1.86       1.48
  -1.6307963]] flipped False image idx 1040 frame_id 002080 

/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
  warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.
xy-guo commented 3 years ago

It's weird. Usually it will output more error messages. btw, did you pull the latest commit?

Owen-Liuyuxuan commented 3 years ago

The error happened in here

x = self.conv_input(input_sp_tensor)

However, I did not see any error during my compilation and installation of spconv.

>>> torch.__version__
'1.7.1+cu110'
>>> torch.version.cuda
'11.0'
xy-guo commented 3 years ago

The possible reasons might be:

Can you do some double check?

Owen-Liuyuxuan commented 3 years ago

The problem maybe that my nvcc version is 11.1 while everything else is 11.0 I need nvcc 11.1+ to install mmcv-full on 3090 (nvcc 11.0 does not support 3090). However, pytorch 1.7.1 does not have cu110 prebuilt wheel. It is rather troublesome.

xy-guo commented 3 years ago

I think you can use the latest pytorch version

fengziyue commented 3 years ago

@Owen-Liuyuxuan Hi, have you tried the latest Pytorch/CUDA version?

Owen-Liuyuxuan commented 3 years ago

@Owen-Liuyuxuan Hi, have you tried the latest Pytorch/CUDA version?

Sorry I have not been working on this for a while :( and have not tried that.

Owen-Liuyuxuan commented 3 years ago

Docker environment:

torch==1.9.1+cu111 torchvision==0.10.1+cu111 mmcv-full=1.2.0 nvcc==11.1.TC455_06 on a RTX 3090 server.

run command:

CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
+ python3 -m torch.distributed.launch --nproc_per_node=1 tools/train.py --launcher pytorch --fix_random_seed --sync_bn --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name exp_name

run command:

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py --launcher none --fix_random_seed --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name debug

It starts but still produces segmentation fault and stop here similar to the original result

xy-guo commented 3 years ago

Can you try run the code step by step to see which step?

Best, Xiaoyang

在 2021年9月28日,下午3:06,Yuxuan Liu @.***> 写道:

 Docker environment:

torch==1.9.1+cu111 torchvision==0.10.1+cu111 mmcv-full=1.2.0 nvcc==11.1.TC455_06 on a RTX 3090 server. run command:

CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml

  • python3 -m torch.distributed.launch --nproc_per_node=1 tools/train.py --launcher pytorch --fix_random_seed --sync_bn --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name exp_name freezes and no output. ctrl+c: not much useful information comes out. run command:

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py --launcher none --fix_random_seed --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name debug It starts but still produces segmentation fault and stop here similar to the original result

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Owen-Liuyuxuan commented 3 years ago

I have tried that (by sync and printing along the way), and it stops here:

x = self.conv_input(input_sp_tensor)

https://github.com/xy-guo/LIGA-Stereo/blob/master/liga/models/backbones_3d_lidar/spconv_backbone.py#L385 which is a direct call to the spconv library.

xy-guo commented 3 years ago

I'm not sure what causes the problem. I've tested my code on a 3070 notebook and everything is fine. I'm not sure if there is a possibility that docker causes the problem?

xy-guo commented 3 years ago

Another suggestion is that do not use --launcher none, the code is only available in distributed mode.

Owen-Liuyuxuan commented 3 years ago

Another suggestion is that do not use --launcher none, the code is only available in distributed mode.

The problem is that if the code is launch in distributed mode, I can not get any error message (and any other training logs) and the child process just dies... I have to run in local mode to actually debug.

Xie-PC commented 2 years ago

I have the same question in a docker with CUDA 10.1/pytorch 1.6.0, do you have salved it?

xy-guo commented 2 years ago

Have you solved the problem? Maybe you can try using the latest commit of spconv?

Xie-PC commented 2 years ago

Have you solved the problem? Maybe you can try using the latest commit of spconv?

I have tried following your advice, but it is still the same as before. Now my CUDA 10.2, install spconv by offical 'pip install spconv-cu102' , I will try it in CUDA 11.1.

BitandPoly commented 2 years ago

Hi,

I faced this problem too. My env is: ubuntu=20.0.6, python=3.7, cuda=11.1, pytorch=1.7.1. My GPU is RTX 8000.

Command I run was: ./scripts/dist_test_ckpt.sh 1 ./configs/stereo/kitti_models/liga.3d-and-bev.yaml ./ckpt/released.final.liga.3d-and-bev.ep53.pth

Pip list is as follows: Package Version Location


addict 2.4.0 certifi 2021.10.8 cycler 0.11.0 Cython 0.29.28 easydict 1.9 fire 0.4.0 fonttools 4.28.2 imageio 2.16.1 kiwisolver 1.3.2 liga 0.1.0+aee3731 /home/qingwu/LIGA-Stereo llvmlite 0.38.0 matplotlib 3.5.0 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mmcv-full 1.2.0 mmdet 2.6.0 /home/qingwu/LIGA-Stereo/mmdetection_kitti mmpycocotools 12.0.3 networkx 2.6.3 numba 0.55.1 numpy 1.21.5 opencv-python 4.5.5.64 packaging 21.3 Pillow 9.0.1 pip 21.2.2 protobuf 3.19.4 pycocotools 2.0 pyparsing 3.0.6 python-dateutil 2.8.2 PyWavelets 1.3.0 PyYAML 5.4.1 scikit-image 0.19.2 scipy 1.7.3 setuptools 58.0.4 setuptools-scm 6.3.2 six 1.16.0 spconv 1.2.1 tensorboardX 2.5 termcolor 1.1.0 terminaltables 3.1.10 tifffile 2021.11.2 tomli 1.2.2 torch 1.7.1 torchaudio 0.7.0a0+a853dff torchvision 0.8.2 tqdm 4.63.1 typing_extensions 4.1.1 wheel 0.37.1 yapf 0.32.0

The error logs are as follows:

size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv1.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). unexpected key in source state_dict: fc.weight, fc.bias, layer3.0.downsample.0.weight, layer3.0.downsample.1.running_mean, layer3.0.downsample.1.running_var, layer3.0.downsample.1.weight, layer3.0.downsample.1.bias, layer4.0.downsample.0.weight, layer4.0.downsample.1.running_mean, layer4.0.downsample.1.running_var, layer4.0.downsample.1.weight, layer4.0.downsample.1.bias

2022-03-24 22:10:59,122 INFO ** Model create finished ** 2022-03-24 22:10:59,123 INFO ** Load checkpoint ** 2022-03-24 22:10:59,123 INFO ==> Loading parameters from checkpoint ./ckpt/released.final.liga.3d-and-bev.ep53.pth to CPU 2022-03-24 22:10:59,157 INFO ==> Checkpoint trained from version: liga+0.1.0+7aa7b92+py72af526 2022-03-24 22:11:00,163 INFO ==> Done (loaded 484/484) 2022-03-24 22:11:00,182 INFO ** Start evaluation ** 2022-03-24 22:11:00,182 INFO * EPOCH 53 EVALUATION *** eval: 0%| | 0/3769 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/qingwu/anaconda3/envs/liga_cuda111/bin/python', '-u', 'tools/test.py', '--local_rank=0', '--launcher', 'pytorch', '--save_to_file', '--cfg_file', './configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--ckpt', './ckpt/released.final.liga.3d-and-bev.ep53.pth']' died with <Signals.SIGSEGV: 11>.

Any ideas? Thanks in advance.

Cheren15 commented 2 years ago

Same fault with CUDA11.1 and pytorch==1.8.0

SibylGao commented 2 years ago

Hi,

I faced this problem too. My env is: ubuntu=20.0.6, python=3.7, cuda=11.1, pytorch=1.7.1. My GPU is RTX 8000.

Command I run was: ./scripts/dist_test_ckpt.sh 1 ./configs/stereo/kitti_models/liga.3d-and-bev.yaml ./ckpt/released.final.liga.3d-and-bev.ep53.pth

Pip list is as follows: Package Version Location

addict 2.4.0 certifi 2021.10.8 cycler 0.11.0 Cython 0.29.28 easydict 1.9 fire 0.4.0 fonttools 4.28.2 imageio 2.16.1 kiwisolver 1.3.2 liga 0.1.0+aee3731 /home/qingwu/LIGA-Stereo llvmlite 0.38.0 matplotlib 3.5.0 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mmcv-full 1.2.0 mmdet 2.6.0 /home/qingwu/LIGA-Stereo/mmdetection_kitti mmpycocotools 12.0.3 networkx 2.6.3 numba 0.55.1 numpy 1.21.5 opencv-python 4.5.5.64 packaging 21.3 Pillow 9.0.1 pip 21.2.2 protobuf 3.19.4 pycocotools 2.0 pyparsing 3.0.6 python-dateutil 2.8.2 PyWavelets 1.3.0 PyYAML 5.4.1 scikit-image 0.19.2 scipy 1.7.3 setuptools 58.0.4 setuptools-scm 6.3.2 six 1.16.0 spconv 1.2.1 tensorboardX 2.5 termcolor 1.1.0 terminaltables 3.1.10 tifffile 2021.11.2 tomli 1.2.2 torch 1.7.1 torchaudio 0.7.0a0+a853dff torchvision 0.8.2 tqdm 4.63.1 typing_extensions 4.1.1 wheel 0.37.1 yapf 0.32.0

The error logs are as follows:

  • python -m torch.distributed.launch --nproc_per_node=1 tools/test.py --launcher pytorch --save_to_file --cfg_file ./configs/stereo/kitti_models/liga.3d-and-bev.yaml --ckpt ./ckpt/released.final.liga.3d-and-bev.ep53.pth 2022-03-24 22:10:58,747 INFO **Start logging** 2022-03-24 22:10:58,747 INFO CUDA_VISIBLE_DEVICES=ALL 2022-03-24 22:10:58,747 INFO eval output dir: ckpt/released.final.liga.3d-and-bev.ep53.pth.eval/eval/epoch_53/val/default 2022-03-24 22:10:58,747 INFO total_batch_size: 1 2022-03-24 22:10:58,747 INFO cfg_file ./configs/stereo/kitti_models/liga.3d-and-bev.yaml 2022-03-24 22:10:58,747 INFO batch_size 1 2022-03-24 22:10:58,747 INFO workers 2 2022-03-24 22:10:58,747 INFO exp_name None 2022-03-24 22:10:58,747 INFO eval_tag default 2022-03-24 22:10:58,747 INFO max_waiting_mins 30 2022-03-24 22:10:58,747 INFO save_to_file True 2022-03-24 22:10:58,747 INFO ckpt ./ckpt/released.final.liga.3d-and-bev.ep53.pth 2022-03-24 22:10:58,747 INFO ckpt_id None 2022-03-24 22:10:58,747 INFO start_epoch 0 2022-03-24 22:10:58,747 INFO launcher pytorch 2022-03-24 22:10:58,747 INFO tcp_port 18888 2022-03-24 22:10:58,747 INFO local_rank 0 2022-03-24 22:10:58,747 INFO set_cfgs None 2022-03-24 22:10:58,747 INFO trainval False 2022-03-24 22:10:58,748 INFO imitation 2d 2022-03-24 22:10:58,748 INFO cfg.ROOT_DIR: /home/qingwu/LIGA-Stereo 2022-03-24 22:10:58,748 INFO cfg.LOCAL_RANK: 0 2022-03-24 22:10:58,748 INFO cfg.CLASS_NAMES: ['Car', 'Pedestrian', 'Cyclist'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATASET: StereoKittiDataset 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_PATH: ./data/kitti 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.FLIP: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.FORCE_FLIP: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_CLOUD_RANGE: [2, -30.4, -3, 59.6, 30.4, 1] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.VOXEL_SIZE: [0.05, 0.05, 0.1] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.STEREO_VOXEL_SIZE: [0.2, 0.2, 0.2] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_SPLIT = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_SPLIT.train: train 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_SPLIT.test: val 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.INFO_PATH = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.INFO_PATH.train: ['kitti_infos_train.pkl'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.INFO_PATH.test: ['kitti_infos_val.pkl'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.USE_VAN: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.USE_PERSON_SITTING: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.FOV_POINTS_ONLY: True 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.BOXES_GT_IN_CAM2_VIEW: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.GENERATE_CORNER_HEATMAP: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.CAT_REFLECT_DIM: False 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.TRAIN_DATA_AUGMENTOR: [{'NAME': 'random_crop', 'MIN_REL_X': 0, 'MAX_REL_X': 0, 'MIN_REL_Y': 1.0, 'MAX_REL_Y': 1.0, 'MAX_CROP_H': 320, 'MAX_CROP_W': 1280}, {'NAME': 'filter_truncated', 'AREA_RATIO_THRESH': None, 'AREA_2D_RATIO_THRESH': None, 'GT_TRUNCATED_THRESH': 0.98}] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.TEST_DATA_AUGMENTOR: [{'NAME': 'random_crop', 'MIN_REL_X': 0, 'MAX_REL_X': 0, 'MIN_REL_Y': 1.0, 'MAX_REL_Y': 1.0, 'MAX_CROP_H': 320, 'MAX_CROP_W': 1280}] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING = edict() 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.encoding_type: absolute_coordinates_encoding 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.used_feature_list: ['x', 'y', 'z'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.src_feature_list: ['x', 'y', 'z'] 2022-03-24 22:10:58,748 INFO cfg.DATA_CONFIG.DATA_PROCESSOR: [{'NAME': 'mask_points_and_boxes_outside_range', 'REMOVE_OUTSIDE_BOXES': True}, {'NAME': 'transform_points_to_voxels', 'VOXEL_SIZE': [0.05, 0.05, 0.1], 'MAX_POINTS_PER_VOXEL': 5, 'MAX_NUMBER_OF_VOXELS': {'train': 40000, 'test': 40000}}] 2022-03-24 22:10:58,749 INFO cfg.DATA_CONFIG._BASECONFIG: ./configs/stereo/dataset_configs/kitti_dataset_fused.yaml 2022-03-24 22:10:58,749 INFO cfg.MODEL = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.NAME: stereo_LIGA 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.NAME: SECONDNet 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.RETURN_BATCH_DICT: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.PRETRAINED_MODEL: ./ckpt/second_s4_hg.iouloss.ep78.backbone-no-final-bnrelu.input-only-xyz.default-lr-policy-with-wd-decay-78ep.pth 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.VFE = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.VFE.NAME: MeanVFE 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_3D = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_3D.NAME: VoxelBackBone4xNoFinalBnReLU 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.MAP_TO_BEV = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.MAP_TO_BEV.NAME: HeightCompression 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 160 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D.NAME: HgBEVBackbone 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D.num_channels: 64 2022-03-24 22:10:58,749 INFO cfg.MODEL.LIDAR_MODEL.BACKBONE_2D.GN: False 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.NAME: LigaBackbone 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.maxdisp: 288 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.downsample_disp: 4 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.GN: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.img_feature_attentionbydisp: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.voxel_attentionbydisp: False 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.cat_img_feature: True 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.num_3dconvs: 1 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone = edict() 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.type: ResNet 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.depth: 34 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.num_stages: 4 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.out_indices: [0, 1, 2, 3] 2022-03-24 22:10:58,749 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.frozen_stages: -1 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_cfg = edict() 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_cfg.type: BN 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_cfg.requires_grad: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.norm_eval: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.style: pytorch 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.with_max_pool: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.deep_stem: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.block_with_final_relu: False 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.base_channels: 64 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.strides: [1, 2, 1, 1] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.dilations: [1, 1, 2, 4] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone.num_channels_factor: [1, 2, 2, 2] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_backbone_pretrained: torchvision://resnet34 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck = edict() 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.GN: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.in_dims: [3, 64, 128, 128, 128] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.start_level: 2 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.stereo_dim: [32, 32] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.with_upconv: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.cat_img_feature: True 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.feature_neck.sem_dim: [128, 32] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck = edict() 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.type: FPN 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.in_channels: [32] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.out_channels: 64 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.start_level: 0 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.add_extra_convs: on_output 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.sem_neck.num_outs: 5 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.cost_volume: [{'type': 'concat', 'downsample': 4}] 2022-03-24 22:10:58,750 INFO cfg.MODEL.BACKBONE_3D.cv_dim: 32 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.rpn3d_dim: 32 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.downsampled_depth_offset: 0.5 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.use_stereo_out_type: feature 2022-03-24 22:10:58,751 INFO cfg.MODEL.BACKBONE_3D.num_hg: 1 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.NAME: MMDet2DHead 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.use_3d_center: True 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.type: ATSSAdvHead 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.reg_class_agnostic: False 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.seperate_extra_reg_branch: False 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.num_classes: 3 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.in_channels: 64 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.stacked_convs: 4 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.feat_channels: 64 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.type: AnchorGenerator 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.ratios: [1.0] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.octave_base_scale: 16 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.scales_per_octave: 1 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.anchor_generator.strides: [4, 8, 16, 32, 64] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.num_extra_reg_channel: 0 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder.type: DeltaXYWHBBoxCoder 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder.target_means: [0.0, 0.0, 0.0, 0.0] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.bbox_coder.target_stds: [0.1, 0.1, 0.2, 0.2] 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls = edict() 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.type: FocalLoss 2022-03-24 22:10:58,751 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.use_sigmoid: True 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.gamma: 2.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.alpha: 0.25 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_cls.loss_weight: 1.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_bbox = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_bbox.type: GIoULoss 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_bbox.loss_weight: 2.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness.type: CrossEntropyLoss 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness.use_sigmoid: True 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.loss_centerness.loss_weight: 1.0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.assigner = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.assigner.type: ATSS3DCenterAssigner 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.assigner.topk: 9 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.allowed_border: -1 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.pos_weight: -1 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.append_3d_centers: True 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.train_cfg.debug: False 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms_pre: 1000 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.min_bbox_size: 0 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.score_thr: 0.05 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms.type: nms 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.nms.iou_threshold: 0.6 2022-03-24 22:10:58,752 INFO cfg.MODEL.DENSE_HEAD_2D.cfg.test_cfg.max_per_img: 100 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV.NAME: HeightCompression 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 160 2022-03-24 22:10:58,752 INFO cfg.MODEL.MAP_TO_BEV.SPARSE_INPUT: False 2022-03-24 22:10:58,752 INFO cfg.MODEL.BACKBONE_2D = edict() 2022-03-24 22:10:58,752 INFO cfg.MODEL.BACKBONE_2D.NAME: HgBEVBackbone 2022-03-24 22:10:58,752 INFO cfg.MODEL.BACKBONE_2D.num_channels: 64 2022-03-24 22:10:58,753 INFO cfg.MODEL.BACKBONE_2D.GN: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.NAME: DetHead 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.NUM_CONVS: 2 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.GN: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.CLASS_AGNOSTIC: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.USE_DIRECTION_CLASSIFIER: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.DIR_OFFSET: 0.78539 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.DIR_LIMIT_OFFSET: 0.0 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.NUM_DIR_BINS: 2 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.CLAMP_VALUE: 10.0 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.xyz_for_angles: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.hwl_for_angles: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.do_feature_imitation: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.imitation_cfg: [{'lidar_feature_layer': 'spatial_features_2d', 'stereo_feature_layer': 'spatial_features_2d', 'normalize': 'cw_scale', 'layer': 'conv2d', 'channel': 64, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}, {'lidar_feature_layer': 'volume_features', 'stereo_feature_layer': 'volume_features', 'normalize': 'cw_scale', 'layer': 'conv3d', 'channel': 32, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}] 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.ANCHOR_GENERATOR_CONFIG: [{'class_name': 'Car', 'anchor_sizes': [[3.9, 1.6, 1.56]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.78], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.6, 'unmatched_threshold': 0.45}, {'class_name': 'Pedestrian', 'anchor_sizes': [[0.8, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'Cyclist', 'anchor_sizes': [[1.76, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}] 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NAME: AxisAlignedTargetAssigner 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.POS_FRACTION: -1.0 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.SAMPLE_SIZE: 512 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NORM_BY_NUM_EXAMPLES: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.MATCH_HEIGHT: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER: ResidualCoder 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.div_by_diagonal: True 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.use_corners: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.use_tanh: False 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG = edict() 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.REG_LOSS_TYPE: WeightedSmoothL1Loss 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.IOU_LOSS_TYPE: IOU3dLoss 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.IMITATION_LOSS_TYPE: WeightedL2WithSigmaLoss 2022-03-24 22:10:58,753 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.cls_weight: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.loc_weight: 0.5 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.dir_weight: 0.2 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.iou_weight: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.imitation_weight: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.code_weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD.LOSS_TYPE = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD.LOSS_TYPE.ce: 1.0 2022-03-24 22:10:58,754 INFO cfg.MODEL.DEPTH_LOSS_HEAD.WEIGHTS: [1.0] 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.RECALL_THRESH_LIST: [0.3, 0.5, 0.7] 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.SCORE_THRESH: 0.1 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.OUTPUT_RAW_SCORE: False 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.EVAL_METRIC: kitti 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG = edict() 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.MULTI_CLASSES_NMS: True 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_TYPE: nms_gpu 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_THRESH: 0.25 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_PRE_MAXSIZE: 4096 2022-03-24 22:10:58,754 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_POST_MAXSIZE: 500 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION = edict() 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.BATCH_SIZE_PER_GPU: 1 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.NUM_EPOCHS: 60 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.OPTIMIZER: adamw 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.LR: 0.001 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.WEIGHT_DECAY: 0.0001 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.MOMENTUM: 0.9 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.DIV_FACTOR: 10 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.DECAY_STEP_LIST: [50] 2022-03-24 22:10:58,754 INFO cfg.OPTIMIZATION.LR_DECAY: 0.1 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.LR_CLIP: 1e-07 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.LR_WARMUP: True 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.WARMUP_EPOCH: 1 2022-03-24 22:10:58,755 INFO cfg.OPTIMIZATION.GRAD_NORM_CLIP: 10 2022-03-24 22:10:58,755 INFO cfg.TAG: liga.3d-and-bev 2022-03-24 22:10:58,755 INFO cfg.EXP_GROUP_PATH: configs_stereo_kitti_models 2022-03-24 22:10:58,775 INFO boxes_gt_in_cam2_view False 2022-03-24 22:10:58,775 INFO Loading KITTI dataset 2022-03-24 22:10:58,874 INFO Total samples for KITTI dataset: 3769 2022-03-24 22:10:58,874 INFO **Creating model ** 2022-03-24 22:10:58,874 INFO **MODEL name is: {'NAME': 'stereo_LIGA', 'LIDAR_MODEL': {'NAME': 'SECONDNet', 'RETURN_BATCH_DICT': True, 'PRETRAINED_MODEL': './ckpt/second_s4_hg.iouloss.ep78.backbone-no-final-bnrelu.input-only-xyz.default-lr-policy-with-wd-decay-78ep.pth', 'VFE': {'NAME': 'MeanVFE'}, 'BACKBONE_3D': {'NAME': 'VoxelBackBone4xNoFinalBnReLU'}, 'MAP_TO_BEV': {'NAME': 'HeightCompression', 'NUM_BEV_FEATURES': 160}, 'BACKBONE_2D': {'NAME': 'HgBEVBackbone', 'num_channels': 64, 'GN': False}}, 'BACKBONE_3D': {'NAME': 'LigaBackbone', 'maxdisp': 288, 'downsample_disp': 4, 'GN': True, 'img_feature_attentionbydisp': True, 'voxel_attentionbydisp': False, 'cat_img_feature': True, 'num_3dconvs': 1, 'feature_backbone': {'type': 'ResNet', 'depth': 34, 'num_stages': 4, 'out_indices': [0, 1, 2, 3], 'frozen_stages': -1, 'norm_cfg': {'type': 'BN', 'requires_grad': True}, 'norm_eval': False, 'style': 'pytorch', 'with_max_pool': False, 'deep_stem': False, 'block_with_final_relu': False, 'base_channels': 64, 'strides': [1, 2, 1, 1], 'dilations': [1, 1, 2, 4], 'num_channels_factor': [1, 2, 2, 2]}, 'feature_backbone_pretrained': 'torchvision://resnet34', 'feature_neck': {'GN': True, 'in_dims': [3, 64, 128, 128, 128], 'start_level': 2, 'stereo_dim': [32, 32], 'with_upconv': True, 'cat_img_feature': True, 'sem_dim': [128, 32]}, 'sem_neck': {'type': 'FPN', 'in_channels': [32], 'out_channels': 64, 'start_level': 0, 'add_extra_convs': 'on_output', 'num_outs': 5}, 'cost_volume': [{'type': 'concat', 'downsample': 4}], 'cv_dim': 32, 'rpn3d_dim': 32, 'downsampled_depth_offset': 0.5, 'use_stereo_out_type': 'feature', 'num_hg': 1}, 'DENSE_HEAD_2D': {'NAME': 'MMDet2DHead', 'use_3d_center': True, 'cfg': {'type': 'ATSSAdvHead', 'reg_class_agnostic': False, 'seperate_extra_reg_branch': False, 'num_classes': 3, 'in_channels': 64, 'stacked_convs': 4, 'feat_channels': 64, 'anchor_generator': {'type': 'AnchorGenerator', 'ratios': [1.0], 'octave_base_scale': 16, 'scales_per_octave': 1, 'strides': [4, 8, 16, 32, 64]}, 'num_extra_reg_channel': 0, 'bbox_coder': {'type': 'DeltaXYWHBBoxCoder', 'target_means': [0.0, 0.0, 0.0, 0.0], 'target_stds': [0.1, 0.1, 0.2, 0.2]}, 'loss_cls': {'type': 'FocalLoss', 'use_sigmoid': True, 'gamma': 2.0, 'alpha': 0.25, 'loss_weight': 1.0}, 'loss_bbox': {'type': 'GIoULoss', 'loss_weight': 2.0}, 'loss_centerness': {'type': 'CrossEntropyLoss', 'use_sigmoid': True, 'loss_weight': 1.0}, 'train_cfg': {'assigner': {'type': 'ATSS3DCenterAssigner', 'topk': 9}, 'allowed_border': -1, 'pos_weight': -1, 'append_3d_centers': True, 'debug': False}, 'test_cfg': {'nms_pre': 1000, 'min_bbox_size': 0, 'score_thr': 0.05, 'nms': {'type': 'nms', 'iou_threshold': 0.6}, 'max_per_img': 100}}}, 'MAP_TO_BEV': {'NAME': 'HeightCompression', 'NUM_BEV_FEATURES': 160, 'SPARSE_INPUT': False}, 'BACKBONE_2D': {'NAME': 'HgBEVBackbone', 'num_channels': 64, 'GN': True}, 'DENSE_HEAD': {'NAME': 'DetHead', 'NUM_CONVS': 2, 'GN': True, 'CLASS_AGNOSTIC': False, 'USE_DIRECTION_CLASSIFIER': True, 'DIR_OFFSET': 0.78539, 'DIR_LIMIT_OFFSET': 0.0, 'NUM_DIR_BINS': 2, 'CLAMP_VALUE': 10.0, 'xyz_for_angles': True, 'hwl_for_angles': True, 'do_feature_imitation': True, 'imitation_cfg': [{'lidar_feature_layer': 'spatial_features_2d', 'stereo_feature_layer': 'spatial_features_2d', 'normalize': 'cw_scale', 'layer': 'conv2d', 'channel': 64, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}, {'lidar_feature_layer': 'volume_features', 'stereo_feature_layer': 'volume_features', 'normalize': 'cw_scale', 'layer': 'conv3d', 'channel': 32, 'ksize': 1, 'use_relu': False, 'mode': 'inbox'}], 'ANCHOR_GENERATOR_CONFIG': [{'class_name': 'Car', 'anchor_sizes': [[3.9, 1.6, 1.56]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.78], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.6, 'unmatched_threshold': 0.45}, {'class_name': 'Pedestrian', 'anchor_sizes': [[0.8, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'Cyclist', 'anchor_sizes': [[1.76, 0.6, 1.73]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 1, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}], 'TARGET_ASSIGNER_CONFIG': {'NAME': 'AxisAlignedTargetAssigner', 'POS_FRACTION': -1.0, 'SAMPLE_SIZE': 512, 'NORM_BY_NUM_EXAMPLES': False, 'MATCH_HEIGHT': False, 'BOX_CODER': 'ResidualCoder', 'BOX_CODER_CONFIG': {'div_by_diagonal': True, 'use_corners': False, 'use_tanh': False}}, 'LOSS_CONFIG': {'REG_LOSS_TYPE': 'WeightedSmoothL1Loss', 'IOU_LOSS_TYPE': 'IOU3dLoss', 'IMITATION_LOSS_TYPE': 'WeightedL2WithSigmaLoss', 'LOSS_WEIGHTS': {'cls_weight': 1.0, 'loc_weight': 0.5, 'dir_weight': 0.2, 'iou_weight': 1.0, 'imitation_weight': 1.0, 'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}}, 'DEPTH_LOSS_HEAD': {'LOSS_TYPE': {'ce': 1.0}, 'WEIGHTS': [1.0]}, 'POST_PROCESSING': {'RECALL_THRESH_LIST': [0.3, 0.5, 0.7], 'SCORE_THRESH': 0.1, 'OUTPUT_RAW_SCORE': False, 'EVAL_METRIC': 'kitti', 'NMS_CONFIG': {'MULTI_CLASSES_NMS': True, 'NMS_TYPE': 'nms_gpu', 'NMS_THRESH': 0.25, 'NMS_PRE_MAXSIZE': 4096, 'NMS_POST_MAXSIZE': 500}}} ** 2022-03-24 22:10:58,884 INFO ==> Loading parameters from checkpoint ./ckpt/second_s4_hg.iouloss.ep78.backbone-no-final-bnrelu.input-only-xyz.default-lr-policy-with-wd-decay-78ep.pth to CPU 2022-03-24 22:10:58,897 INFO ==> Checkpoint trained from version: liga+0.1.0+7aa7b92+py60b444b 2022-03-24 22:10:58,948 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.0.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_cls_convs.1.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.0.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.0.weight: torch.Size([64, 64, 3, 3]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.weight: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.bias: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.running_mean: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.running_var: torch.Size([64]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.rpn3d_bbox_convs.1.0.1.num_batches_tracked: torch.Size([]) 2022-03-24 22:10:58,949 INFO Not Loaded weight dense_head.conv_cls.weight: torch.Size([18, 64, 3, 3]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_cls.bias: torch.Size([18]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_box.weight: torch.Size([42, 64, 3, 3]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_box.bias: torch.Size([42]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_dir_cls.weight: torch.Size([12, 64, 1, 1]) 2022-03-24 22:10:58,950 INFO Not Loaded weight dense_head.conv_dir_cls.bias: torch.Size([12]) 2022-03-24 22:10:58,950 INFO ==> Done (loaded 110/110) stereo volume depth range: 2.0 -> 59.599998474121094, interval 0.19999999470180935 2022-03-24 22:10:59,105 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.0.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.0.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.1.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.1.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.2.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.2.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.3.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.3.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.4.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.4.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn1.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer3.5.bn2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer3.5.bn2.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv1.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.0.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.0.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.1.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.1.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). size mismatch for layer4.2.bn2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for layer4.2.bn2.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]). unexpected key in source state_dict: fc.weight, fc.bias, layer3.0.downsample.0.weight, layer3.0.downsample.1.running_mean, layer3.0.downsample.1.running_var, layer3.0.downsample.1.weight, layer3.0.downsample.1.bias, layer4.0.downsample.0.weight, layer4.0.downsample.1.running_mean, layer4.0.downsample.1.running_var, layer4.0.downsample.1.weight, layer4.0.downsample.1.bias

2022-03-24 22:10:59,122 INFO ** Model create finished ** 2022-03-24 22:10:59,123 INFO ** Load checkpoint ** 2022-03-24 22:10:59,123 INFO ==> Loading parameters from checkpoint ./ckpt/released.final.liga.3d-and-bev.ep53.pth to CPU 2022-03-24 22:10:59,157 INFO ==> Checkpoint trained from version: liga+0.1.0+7aa7b92+py72af526 2022-03-24 22:11:00,163 INFO ==> Done (loaded 484/484) 2022-03-24 22:11:00,182 INFO ** Start evaluation ** 2022-03-24 22:11:00,182 INFO * EPOCH 53 EVALUATION *** eval: 0%| | 0/3769 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/qingwu/anaconda3/envs/liga_cuda111/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/qingwu/anaconda3/envs/liga_cuda111/bin/python', '-u', 'tools/test.py', '--local_rank=0', '--launcher', 'pytorch', '--save_to_file', '--cfg_file', './configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--ckpt', './ckpt/released.final.liga.3d-and-bev.ep53.pth']' died with <Signals.SIGSEGV: 11>.

Any ideas? Thanks in advance.

Hi, have you solved this problem? I meet same error massages.

SibylGao commented 2 years ago

Same problem with nvcc 10.1, nvidia-smi 10.2, pytorch 1.6.0 + cudatoolkit 10.1, mmcvfull 1.2.1, mmdet 2.6.0 and graphic cards Tesla V100s I've worked on it for few days and still can not solve this problem :(