xingyizhou / CenterNet2

Two-stage CenterNet
Apache License 2.0
1.19k stars 189 forks source link

Can not reproduce the effect #53

Open liuheng0111 opened 3 years ago

liuheng0111 commented 3 years ago

I train model of CenterNet2_R50_1x , use v100 8gpus, but the best AP is 40.26, lower of 42.9; Can you give me some suggestions ? I use the floowing configs:

DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) SOLVER: IMS_PER_BATCH: 16 BASE_LR: 0.02 STEPS: (60000, 80000) MAX_ITER: 90000 CHECKPOINT_PERIOD: 1000000000 WARMUP_ITERS: 4000 WARMUP_FACTOR: 0.00025 CLIP_GRADIENTS: ENABLED: True INPUT: MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)

xingyizhou commented 3 years ago

Hi, Thank you for your interest. Can you show the full config including the _BASE_? If you are using the exact CenterNet2_R50_1x, you don't need to copy these configs as they are contained in the base file. If your _BASE_ is already Base-CenterNet2.yaml, can you also provide the command of your training?

Best, Xingyi

liuheng0111 commented 3 years ago

Yes, my BASE is already Base-CenterNet2.yaml. Trainning command is: python train_net.py --num-gpus 8 --config-file configs/CenterNet2_R50_1x.yaml

xingyizhou commented 3 years ago

Hi, I have shared my training and evaluation log for the R50-1x model here. Please have a check if there is anything clearly mismatched. If it doesn't work, I am happy to check your training log if you can upload it.

liuheng0111 commented 3 years ago

My traing log is very different from yours. My loss like: total_loss: 1.169 loss_box_reg_stage0: 0.1548 loss_box_reg_stage1: 0.2466 loss_box_reg_stage2: 0.2614 loss_centernet_agn_neg: 0.02222 loss_centernet_agn_pos: 0.05661 loss_centernet_loc: 0.1677 loss_cls_stage0: 0.09623 loss_cls_stage1: 0.08068 But you is: stage0/fast_rcnn/cls_accuracy: 0.944 stage0/fast_rcnn/fg_cls_accuracy: 0.755 stage1/fast_rcnn/cls_accuracy: 0.949 stage1/fast_rcnn/fg_cls_accuracy: 0.780 stage2/fast_rcnn/cls_accuracy: 0.955 stage2/fast_rcnn/fg_cls_accuracy: 0.773 total_loss: 1.401 loss_box_reg_stage0: 0.177 loss_box_reg_stage1: 0.260 loss_box_reg_stage2: 0.262 loss_centernet_agn_neg: 0.033 loss_centernet_agn_pos: 0.087 loss_centernet_loc: 0.198 loss_cls_stage0: 0.144 loss_cls_stage1: 0.126 loss_cls_stage2: 0.115 time: 0.5513 data_time: 0.0178 lr: 0.000200 max_mem: 5000M.

Is it because a different version of detectron2 is used?

you model output config has: EFFICIENTNET: BASE_NAME: efficientnet_b3 NORM: FrozenBN OUT_LEVELS: (3, 4, 5) why EFFICIENTNET in the config? Your Base-CenterNet2.yaml is this one ?

xingyizhou commented 3 years ago

Hi, Sorry for the delayed response. It seems I do not have access to your log, can you share it? For the difference in the log format, yes I used my own detectron2 version which I printed more statistics during training (simply modifying this line). EfficientNet is in my own version and is not used in this project. These should not affect the functionality of the codebase. The referred Base-CenterNet2.yaml is correct. I'll have a better idea when I see your log.

bywang2018 commented 3 years ago

Hello, where can I download this dataset(coco_un_yolov4_55_0.5)? Thanks! @xingyizhou

liuheng0111 commented 3 years ago

My traing log.

Another Question: I want to train detect with my self dataset, some boxes have not category, some boxes have category. I use two ways: 1: add another category; 2: didnot add another category, set -999 as the category which box has no category; box has no category doesn't Calculation the cls-loss. I trained on coco, mask half bbox with no category, validation dataset bbox all have category. Two ways both has 30 MAP, but the loss is lagger when the AP goes on; total_loss: 10.1 loss_box_reg_stage0: 2.176 loss_box_reg_stage1: 2.417 loss_box_reg_stage2: 4.492 loss_centernet_agn_neg: 0.03299 loss_centernet_agn_pos: 0.1173 loss_centernet_loc: 0.2441 loss_cls_stage0: 0.1453 loss_cls_stage1: 0.1776 loss_cls_stage2: 0.2449 The box regresion loss is very big,but the MAP on validate dataset get better. Is there any suggestions ? The way 2 trainning log on coco;

xingyizhou commented 3 years ago

The log shows you are using a batchsize of 96. Can you use the original batch size (16) or modify the total iterations and learning rate accordingly? Please do specify any changes you made in the code when reporting reproducibility issues.

xingyizhou commented 3 years ago

@WangBoying you can download it here from the model zoo.

bywang2018 commented 3 years ago

This is very important to me! Thank you very much!

@xingyizhou

liuheng0111 commented 3 years ago

@xingyizhou The original code and config trained log

LeonLab commented 2 years ago

@liuheng0111 Did you found the solution?I have a similar problem with you