zju3dv / disprcnn

Code release for Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation (CVPR 2020, TPAMI 2021)
Apache License 2.0
213 stars 36 forks source link

执行 sh scripts/car/vob/train_smrcnn.sh 卡死 #50

Closed JimRenault closed 1 year ago

JimRenault commented 1 year ago

I use nohup execute sh scripts/car/vob train_smrcnn.sh command, the configs/kitti/car/vob/mask yaml configuration file, MAX_ITER: 4140. My error message is as follows (intercept at the end of the nohup.out file)

2023-04-20 07:59:27,417 disprcnn.trainer INFO: eta: 0:00:38 iter: 4098 valid_iter: 4023 loss: 0.1156 (0.1763) loss_mask: 0.1156 (0.1763) time: 0.8900 (0.9200) data: 0.0454 (0.0493) lr: 0.00003952 max mem: 4420 2023-04-20 07:59:28,285 disprcnn.trainer INFO: eta: 0:00:37 iter: 4099 valid_iter: 4024 loss: 0.1156 (0.1763) loss_mask: 0.1156 (0.1763) time: 0.8900 (0.9200) data: 0.0454 (0.0493) lr: 0.00003884 max mem: 4420 2023-04-20 07:59:29,133 disprcnn.trainer INFO: eta: 0:00:36 iter: 4100 valid_iter: 4025 loss: 0.1156 (0.1763) loss_mask: 0.1156 (0.1763) time: 0.8900 (0.9199) data: 0.0455 (0.0493) lr: 0.00003817 max mem: 4420 2023-04-20 07:59:29,975 disprcnn.trainer INFO: eta: 0:00:35 iter: 4101 valid_iter: 4026 loss: 0.1151 (0.1762) loss_mask: 0.1151 (0.1762) time: 0.8900 (0.9199) data: 0.0455 (0.0493) lr: 0.00003751 max mem: 4420 2023-04-20 07:59:30,852 disprcnn.trainer INFO: eta: 0:00:34 iter: 4102 valid_iter: 4027 loss: 0.1150 (0.1762) loss_mask: 0.1150 (0.1762) time: 0.8785 (0.9199) data: 0.0457 (0.0493) lr: 0.00003685 max mem: 4420 2023-04-20 07:59:31,726 disprcnn.trainer INFO: eta: 0:00:34 iter: 4103 valid_iter: 4028 loss: 0.1150 (0.1762) loss_mask: 0.1150 (0.1762) time: 0.8733 (0.9199) data: 0.0457 (0.0493) lr: 0.00003619 max mem: 4420 2023-04-20 07:59:32,555 disprcnn.trainer INFO: eta: 0:00:33 iter: 4104 valid_iter: 4029 loss: 0.1150 (0.1762) loss_mask: 0.1150 (0.1762) time: 0.8682 (0.9199) data: 0.0455 (0.0492) lr: 0.00003555 max mem: 4420

JimRenault commented 1 year ago

And it gets stuck in the same place every time it runs, iter: 4104. Has anyone been in my situation? Can you help me? I'm very grateful for it!

ootts commented 1 year ago

See here. Since the training is nearly complete, you can terminate the program and use the final checkpoint.