xingyizhou / CenterNet2

Two-stage CenterNet
Apache License 2.0
1.2k stars 189 forks source link

Error :AssertionError: Box tensor contains infinite or NaN! #42

Closed Lg955 closed 3 years ago

Lg955 commented 3 years ago

📚 Documentation

I used the coco dataset to train centernet2, but got a error:

Traceback (most recent call last):
  File "./train_net.py", line 245, in <module>
    args=(args,),
  File "/dataset/datacode/code/CenterNet2/detectron2/engine/launch.py", line 62, in launch
    main_func(*args)
  File "./train_net.py", line 226, in main
    do_train(cfg, model, resume=args.resume)
  File "./train_net.py", line 128, in do_train
    loss_dict = model(data)
  File "/home/das/anaconda3/envs/centernet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dataset/datacode/code/CenterNet2/detectron2/modeling/meta_arch/rcnn.py", line 166, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/das/anaconda3/envs/centernet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dataset/datacode/code/CenterNet2/projects/CenterNet2/centernet/modeling/roi_heads/custom_roi_heads.py", line 166, in forward
    losses = self._forward_box(features, proposals, targets)
  File "/dataset/datacode/code/CenterNet2/projects/CenterNet2/centernet/modeling/roi_heads/custom_roi_heads.py", line 116, in _forward_box
    proposals = self._create_proposals_from_boxes(prev_pred_boxes, image_sizes)
  File "/dataset/datacode/code/CenterNet2/detectron2/modeling/roi_heads/cascade_rcnn.py", line 290, in _create_proposals_from_boxes
    boxes_per_image.clip(image_size)
  File "/dataset/datacode/code/CenterNet2/detectron2/structures/boxes.py", line 200, in clip
    assert torch.isfinite(self.tensor).all(), "Box tensor contains infinite or NaN!"
AssertionError: Box tensor contains infinite or NaN!

where is the main wrong? I think the coco dataset is OK

xingyizhou commented 3 years ago

Hi, Are you using the default training script in the model zoo? If so, please specify which model you are using and I can have a check. Otherwise, this means the training diverged. If this happens in the first few iterations (e.g., < iteration 1000), you can try increasing the warmup iteration. Otherwise you can consider decreasing the learning rate, or change the normalization layers in the backbone to "SyncBN".

Lg955 commented 3 years ago

Hi, Are you using the default training script in the model zoo? If so, please specify which model you are using and I can have a check. Otherwise, this means the training diverged. If this happens in the first few iterations (e.g., < iteration 1000), you can try increasing the warmup iteration. Otherwise you can consider decreasing the learning rate, or change the normalization layers in the backbone to "SyncBN".

I uesd theCenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST.yaml as well as the default training script in the model zoo. I think the main error came from Detectron2, so I dealt with the errror after recompiling it. Thank you!