yzd-v / FGD

Focal and Global Knowledge Distillation for Detectors (CVPR 2022)
Apache License 2.0
348 stars 46 forks source link

Drop in mAP after Retinanet Distillation #91

Open Rrschch-6 opened 1 year ago

Rrschch-6 commented 1 year ago

I am using fgd_retina_r101_fpn_2x_distill_retina_r50_fpn_2x_coco.py to distill Retinanet-r101 (Using my own CoCo formatted) with mAP 0.75 but my Retinanet-r50 mAP significantly drops to 0.58. I am using SGD with same parameters set in orignal config. Training stops with loss_cls around 2.5 after 20 epoch:

"epoch": 20, "iter": 200, "lr": 0.001, "memory": 5598, "data_time": 0.03356, "loss_cls": 0.25744, "loss_bbox": 0.19824, "loss_fgd_fpn_4": 2.68374, "loss_fgd_fpn_3": 2.18955, "loss_fgd_fpn_2": 0.43481, "loss_fgd_fpn_1": 1.17425, "loss_fgd_fpn_0": 4.25011, "loss": 11.18815, "grad_norm": 53.40476, "time": 0.65314 "bbox_mAP": 0.561

"epoch": 21, "iter": 200, "lr": 0.001, "memory": 5598, "data_time": 0.02153, "loss_cls": 0.26974, "loss_bbox": 0.21168, "loss_fgd_fpn_4": 2.03119, "loss_fgd_fpn_3": 1.96431, "loss_fgd_fpn_2": 0.43961, "loss_fgd_fpn_1": 1.17302, "loss_fgd_fpn_0": 4.24643, "loss": 10.33597, "grad_norm": 43.16615, "time": 0.60769} "bbox_mAP": 0.571,

"epoch": 22, "iter": 200, "lr": 0.001, "memory": 5598, "data_time": 0.02091, "loss_cls": 0.25263, "loss_bbox": 0.19929, "loss_fgd_fpn_4": 1.99079, "loss_fgd_fpn_3": 1.83311, "loss_fgd_fpn_2": 0.42802, "loss_fgd_fpn_1": 1.14707, "loss_fgd_fpn_0": 4.15834, "loss": 10.00925, "grad_norm": 46.78779, "time": 0.62421 bbox_mAP": 0.578,

"epoch": 23, "iter": 200, "lr": 0.0001, "memory": 5598, "data_time": 0.01811, "loss_cls": 0.25804, "loss_bbox": 0.20361, "loss_fgd_fpn_4": 1.42493, "loss_fgd_fpn_3": 1.68207, "loss_fgd_fpn_2": 0.41495, "loss_fgd_fpn_1": 1.12626, "loss_fgd_fpn_0": 4.09925, "loss": 9.20911, "grad_norm": 20.09985, "time": 0.60161 "bbox_mAP": 0.577,

"epoch": 24, "iter": 200, "lr": 0.0001, "memory": 5598, "data_time": 0.01983, "loss_cls": 0.25139, "loss_bbox": 0.1968, "loss_fgd_fpn_4": 1.32875, "loss_fgd_fpn_3": 1.57256, "loss_fgd_fpn_2": 0.41188, "loss_fgd_fpn_1": 1.11039, "loss_fgd_fpn_0": 4.06505, "loss": 8.93682, "grad_norm": 19.73, "time": 0.62296} "bbox_mAP": 0.581,

Is this happening because I am distilling only using my fine tuning data? If not what can be the problem?

yzd-v commented 1 year ago

It seems strange. What's the teacher's and student's performance before distillation. Teacher is 75%, how about the student?

Rrschch-6 commented 1 year ago

my student mAP is 73% on my test dataset.

Let me describe my workflow: 1- I trained Retinanet-r101 with my data (I am using my usecase data which is damage detection on equipment) using pretrained resnet-101. As I said my mAP is 0.75 2- Then I trained Retinanet-r50 with my data (Same data used for training in step 1) using pretrained resnet-50. As I said my mAP is 0.73. This I am using for baseline. 3- I distill using checkpoint if Retinanet-r101 for distilling to Retinanet-r50. mAP drops to 58%

Note: I put my data dict in distiller config

Is initializing the student backbone with already trained Retinanet-r50 backbone (in step 2) helps?

yzd-v commented 1 year ago

For distillation, you should keep the training setting as 2. For example, using pretrained Res-50 first, then train the student with FGD. Besides, you can use inheriting strategy to further improve the studnet.

Rrschch-6 commented 1 year ago

Thanks for the reply. 1- Would you please explain more about keeping distillation the setting as 2? 2- I am using inheriting strategy for initializing neck and head of student with teachers.

yzd-v commented 1 year ago

The training and initialization setting for baseline and distillation should be the same. Such as using pretrained Res-50. Normally, the perfromance after the first epoch would be much higher than that of baseline.

Rrschch-6 commented 1 year ago

1- In here the initialization of the backlog skipped:

if name.startswith("backbone."): continue 2- My teacher and student trained with Adam lr=0.001, Do you think should I change the distiller configuration to the same parameters?

yzd-v commented 1 year ago
  1. Not this, you do not need to skip it.
  2. Keep the same, including optimizer.
Rrschch-6 commented 1 year ago

I have initialized the backbone of the student and adjusted the optimizer same as baseline. now I am starting with 71% with the first epoch. But in the next epochs I am getting significant drops and seems the model is not converging:

2023-09-06 18:25:50,163 - mmdet - INFO - Epoch [1][50/203] lr: 9.890e-05, eta: 3:06:07, time: 2.316, data_time: 1.673, memory: 5745, loss_cls: 0.9247, loss_bbox: 0.4944, loss_fgd_fpn_4: 37.8894, loss_fgd_fpn_3: 7.7019, loss_fgd_fpn_2: 0.6114, loss_fgd_fpn_1: 2.2289, loss_fgd_fpn_0: 8.1574, loss: 58.0080, grad_norm: 571.0370 2023-09-06 18:26:23,640 - mmdet - INFO - Epoch [1][100/203] lr: 1.988e-04, eta: 1:58:45, time: 0.670, data_time: 0.069, memory: 5745, loss_cls: 0.1247, loss_bbox: 0.1389, loss_fgd_fpn_4: 2.6471, loss_fgd_fpn_3: 0.9848, loss_fgd_fpn_2: 0.2494, loss_fgd_fpn_1: 0.8901, loss_fgd_fpn_0: 3.3881, loss: 8.4231, grad_norm: 228.2081 2023-09-06 18:26:54,944 - mmdet - INFO - Epoch [1][150/203] lr: 2.987e-04, eta: 1:34:45, time: 0.626, data_time: 0.020, memory: 5745, loss_cls: 0.1039, loss_bbox: 0.1309, loss_fgd_fpn_4: 4.3547, loss_fgd_fpn_3: 0.9474, loss_fgd_fpn_2: 0.1836, loss_fgd_fpn_1: 0.6756, loss_fgd_fpn_0: 2.6093, loss: 9.0055, grad_norm: 350.5065 2023-09-06 18:27:25,823 - mmdet - INFO - Epoch [1][200/203] lr: 3.986e-04, eta: 1:22:20, time: 0.618, data_time: 0.019, memory: 5745, loss_cls: 0.1332, loss_bbox: 0.1357, loss_fgd_fpn_4: 6.1208, loss_fgd_fpn_3: 1.2110, loss_fgd_fpn_2: 0.1994, loss_fgd_fpn_1: 0.6954, loss_fgd_fpn_0: 2.6123, loss: 11.1077, grad_norm: 423.2370 bbox_mAP: 0.7140

2023-09-06 18:31:15,442 - mmdet - INFO - Epoch [2][50/203] lr: 5.045e-04, eta: 1:38:13, time: 2.226, data_time: 1.617, memory: 5745, loss_cls: 0.1705, loss_bbox: 0.1552, loss_fgd_fpn_4: 2.1541, loss_fgd_fpn_3: 0.8034, loss_fgd_fpn_2: 0.1834, loss_fgd_fpn_1: 0.6773, loss_fgd_fpn_0: 2.6396, loss: 6.7835, grad_norm: 174.3350 2023-09-06 18:31:47,209 - mmdet - INFO - Epoch [2][100/203] lr: 6.044e-04, eta: 1:29:06, time: 0.635, data_time: 0.020, memory: 5745, loss_cls: 0.1480, loss_bbox: 0.1469, loss_fgd_fpn_4: 2.1242, loss_fgd_fpn_3: 0.6617, loss_fgd_fpn_2: 0.1525, loss_fgd_fpn_1: 0.5729, loss_fgd_fpn_0: 2.2660, loss: 6.0722, grad_norm: 183.8174 2023-09-06 18:32:18,974 - mmdet - INFO - Epoch [2][150/203] lr: 7.043e-04, eta: 1:22:25, time: 0.635, data_time: 0.021, memory: 5745, loss_cls: 0.1637, loss_bbox: 0.1636, loss_fgd_fpn_4: 1.7635, loss_fgd_fpn_3: 0.6042, loss_fgd_fpn_2: 0.1576, loss_fgd_fpn_1: 0.5902, loss_fgd_fpn_0: 2.3537, loss: 5.7965, grad_norm: 154.4271 2023-09-06 18:32:50,011 - mmdet - INFO - Epoch [2][200/203] lr: 8.042e-04, eta: 1:17:08, time: 0.621, data_time: 0.029, memory: 5745, loss_cls: 0.1569, loss_bbox: 0.1697, loss_fgd_fpn_4: 3.1833, loss_fgd_fpn_3: 0.7770, loss_fgd_fpn_2: 0.1706, loss_fgd_fpn_1: 0.6309, loss_fgd_fpn_0: 2.3866, loss: 7.4749, grad_norm: 227.1264 bbox_mAP: 0.6870

`2023-09-06 18:36:43,303 - mmdet - INFO - Epoch [3][50/203] lr: 9.101e-04, eta: 1:25:49, time: 2.288, data_time: 1.640, memory: 5745, loss_cls: 0.4277, loss_bbox: 0.2139, loss_fgd_fpn_4: 4.8695, loss_fgd_fpn_3: 1.1391, loss_fgd_fpn_2: 0.2268, loss_fgd_fpn_1: 0.8387, loss_fgd_fpn_0: 3.3569, loss: 11.0726, grad_norm: 293.6464 2023-09-06 18:37:14,756 - mmdet - INFO - Epoch [3][100/203] lr: 1.000e-03, eta: 1:20:59, time: 0.629, data_time: 0.019, memory: 5745, loss_cls: 0.4082, loss_bbox: 0.2028, loss_fgd_fpn_4: 2.1853, loss_fgd_fpn_3: 0.7505, loss_fgd_fpn_2: 0.2096, loss_fgd_fpn_1: 0.7672, loss_fgd_fpn_0: 3.3458, loss: 7.8694, grad_norm: 174.7761 2023-09-06 18:37:46,440 - mmdet - INFO - Epoch [3][150/203] lr: 1.000e-03, eta: 1:16:57, time: 0.634, data_time: 0.022, memory: 5745, loss_cls: 0.2216, loss_bbox: 0.1832, loss_fgd_fpn_4: 2.8286, loss_fgd_fpn_3: 0.7335, loss_fgd_fpn_2: 0.1637, loss_fgd_fpn_1: 0.5890, loss_fgd_fpn_0: 2.2849, loss: 7.0046, grad_norm: 201.6889 2023-09-06 18:38:18,894 - mmdet - INFO - Epoch [3][200/203] lr: 1.000e-03, eta: 1:13:36, time: 0.649, data_time: 0.017, memory: 5745, loss_cls: 0.4074, loss_bbox: 0.2027, loss_fgd_fpn_4: 2.2815, loss_fgd_fpn_3: 0.8024, loss_fgd_fpn_2: 0.1913, loss_fgd_fpn_1: 0.7121, loss_fgd_fpn_0: 2.5630, loss: 7.1605, grad_norm: 149.1443

bbox_mAP: 0.4750`

yzd-v commented 1 year ago

It seems strange. Does the baseline keep the same that the first epoch performs best? Does the learning rate for distillation keep the same as baseline.

Rrschch-6 commented 1 year ago

Thanks. The problem was learning rate. I reduced my learning rate and now the optimization is working. I will share the result under this post for reference.

My other question is : What **loss_fgd_fpn_0 to loss_fgd_fpn_4*** I mean is loss_fgd_fpn_0 Lat and loss_fgd_fpn_4 is Lfocal?