xingyizhou / CenterNet2

Two-stage CenterNet
Apache License 2.0
1.2k stars 188 forks source link

Error :AssertionError: Box tensor contains infinite or NaN! #65

Open pSGAme opened 3 years ago

pSGAme commented 3 years ago

📚 Documentation

I use my custom data to train CenterNet2, however i met this error after about 3hours:

[08/09 22:12:04 d2.utils.events]: [eta: 1 day, 11:40:59 iter: 3660 total_loss: 1.185 loss_cls_stage0: 0.1144 loss_box_reg_stage0: 0.05557 loss_cls_stage1: 0.06713 loss_box_reg_stage1: 0.02371 loss_cls_stage2: 0.04383 loss_box_reg_stage2: 0.007958 loss_centernet_loc: 0.5499 loss_centernet_agn_pos: 0.3275 loss_centernet_agn_neg: 0.04878 time: 1.5286 data_time: 0.9600 lr: 0.018295 max_mem: 15129M [08/09 22:12:47 d2.utils.events]: [0m eta: 1 day, 11:42:35 iter: 3680 total_loss: 1.172 loss_cls_stage0: 0.09907 loss_box_reg_stage0: 0.05197 loss_cls_stage1: 0.06327 loss_box_reg_stage1: 0.02357 loss_cls_stage2: 0.0416 loss_box_reg_stage2: 0.008138 loss_centernet_loc: 0.538 loss_centernet_agn_pos: 0.3417 loss_centernet_agn_neg: 0.0396 time: 1.5287 data_time: 0.5869 lr: 0.018395 max_mem: 15129M [08/09 22:13:34 d2.utils.events]: [0m eta: 1 day, 11:41:24 iter: 3700 total_loss: 1.216 loss_cls_stage0: 0.1155 loss_box_reg_stage0: 0.05366 loss_cls_stage1: 0.06624 loss_box_reg_stage1: 0.02732 loss_cls_stage2: 0.04616 loss_box_reg_stage2: 0.01099 loss_centernet_loc: 0.5492 loss_centernet_agn_pos: 0.326 loss_centernet_agn_neg: 0.04868 time: 1.5288 data_time: 0.8384 lr: 0.018495 max_mem: 15129M [08/09 22:14:15 d2.utils.events]: [0m eta: 1 day, 11:40:55 iter: 3720 total_loss: 1.221 loss_cls_stage0: 0.113 loss_box_reg_stage0: 0.05246 loss_cls_stage1: 0.06204 loss_box_reg_stage1: 0.02323 loss_cls_stage2: 0.03913 loss_box_reg_stage2: 0.007532 loss_centernet_loc: 0.5236 loss_centernet_agn_pos: 0.3329 loss_centernet_agn_neg: 0.03891 time: 1.5287 data_time: 0.5345 lr: 0.018595 max_mem: 15129M No instances! torch.Size([0, 7]) torch.Size([0, 4]) 16 No instance in box reg loss No instances! torch.Size([0, 7]) torch.Size([0, 4]) 16 No instance in box reg loss Traceback (most recent call last): File "train_net.py", line 323, in args=(args,), File "/cache/user-job-dir/codes/detectron2/engine/launch.py", line 82, in launch main_func(args) File "train_net.py", line 303, in main do_train(cfg, model, resume=args.resume) File "train_net.py", line 207, in do_train loss_dict = model(data) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, *kwargs) File "/cache/user-job-dir/codes/detectron2/modeling/metaarch/rcnn.py", line 163, in forward , detector_losses = self.roi_heads(images, features, proposals, gt_instances) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/cache/user-job-dir/codes/centernet/modeling/roi_heads/custom_roi_heads.py", line 166, in forward losses = self._forward_box(features, proposals, targets) File "/cache/user-job-dir/codes/centernet/modeling/roi_heads/custom_roi_heads.py", line 116, in _forward_box proposals = self._create_proposals_from_boxes(prev_pred_boxes, image_sizes) File "/cache/user-job-dir/codes/detectron2/modeling/roi_heads/cascade_rcnn.py", line 290, in _create_proposals_from_boxes boxes_per_image.clip(image_size) File "/cache/user-job-dir/codes/detectron2/structures/boxes.py", line 200, in clip assert torch.isfinite(self.tensor).all(), "Box tensor contains infinite or NaN!" AssertionError: Box tensor contains infinite or NaN!`

i have no idea why training diverged, since the loss is quite small.

xingyizhou commented 3 years ago

Hi, Thank you for your interest and sorry for my delayed response. This means the training diverged. Common ways to avoid this issue are to increase the warmup iterations (SOLVER.WARMUP_ITERS) or decrease the learning rate.