Open Rrschch-6 opened 1 year ago
It seems strange. What's the teacher's and student's performance before distillation. Teacher is 75%, how about the student?
my student mAP is 73% on my test dataset.
Let me describe my workflow: 1- I trained Retinanet-r101 with my data (I am using my usecase data which is damage detection on equipment) using pretrained resnet-101. As I said my mAP is 0.75 2- Then I trained Retinanet-r50 with my data (Same data used for training in step 1) using pretrained resnet-50. As I said my mAP is 0.73. This I am using for baseline. 3- I distill using checkpoint if Retinanet-r101 for distilling to Retinanet-r50. mAP drops to 58%
Note: I put my data dict in distiller config
Is initializing the student backbone with already trained Retinanet-r50 backbone (in step 2) helps?
For distillation, you should keep the training setting as 2. For example, using pretrained Res-50 first, then train the student with FGD. Besides, you can use inheriting strategy to further improve the studnet.
Thanks for the reply. 1- Would you please explain more about keeping distillation the setting as 2? 2- I am using inheriting strategy for initializing neck and head of student with teachers.
The training and initialization setting for baseline and distillation should be the same. Such as using pretrained Res-50. Normally, the perfromance after the first epoch would be much higher than that of baseline.
1- In here the initialization of the backlog skipped:
if name.startswith("backbone."): continue
2- My teacher and student trained with Adam lr=0.001, Do you think should I change the distiller configuration to the same parameters?
I have initialized the backbone of the student and adjusted the optimizer same as baseline. now I am starting with 71% with the first epoch. But in the next epochs I am getting significant drops and seems the model is not converging:
2023-09-06 18:25:50,163 - mmdet - INFO - Epoch [1][50/203] lr: 9.890e-05, eta: 3:06:07, time: 2.316, data_time: 1.673, memory: 5745, loss_cls: 0.9247, loss_bbox: 0.4944, loss_fgd_fpn_4: 37.8894, loss_fgd_fpn_3: 7.7019, loss_fgd_fpn_2: 0.6114, loss_fgd_fpn_1: 2.2289, loss_fgd_fpn_0: 8.1574, loss: 58.0080, grad_norm: 571.0370 2023-09-06 18:26:23,640 - mmdet - INFO - Epoch [1][100/203] lr: 1.988e-04, eta: 1:58:45, time: 0.670, data_time: 0.069, memory: 5745, loss_cls: 0.1247, loss_bbox: 0.1389, loss_fgd_fpn_4: 2.6471, loss_fgd_fpn_3: 0.9848, loss_fgd_fpn_2: 0.2494, loss_fgd_fpn_1: 0.8901, loss_fgd_fpn_0: 3.3881, loss: 8.4231, grad_norm: 228.2081 2023-09-06 18:26:54,944 - mmdet - INFO - Epoch [1][150/203] lr: 2.987e-04, eta: 1:34:45, time: 0.626, data_time: 0.020, memory: 5745, loss_cls: 0.1039, loss_bbox: 0.1309, loss_fgd_fpn_4: 4.3547, loss_fgd_fpn_3: 0.9474, loss_fgd_fpn_2: 0.1836, loss_fgd_fpn_1: 0.6756, loss_fgd_fpn_0: 2.6093, loss: 9.0055, grad_norm: 350.5065 2023-09-06 18:27:25,823 - mmdet - INFO - Epoch [1][200/203] lr: 3.986e-04, eta: 1:22:20, time: 0.618, data_time: 0.019, memory: 5745, loss_cls: 0.1332, loss_bbox: 0.1357, loss_fgd_fpn_4: 6.1208, loss_fgd_fpn_3: 1.2110, loss_fgd_fpn_2: 0.1994, loss_fgd_fpn_1: 0.6954, loss_fgd_fpn_0: 2.6123, loss: 11.1077, grad_norm: 423.2370 bbox_mAP: 0.7140
2023-09-06 18:31:15,442 - mmdet - INFO - Epoch [2][50/203] lr: 5.045e-04, eta: 1:38:13, time: 2.226, data_time: 1.617, memory: 5745, loss_cls: 0.1705, loss_bbox: 0.1552, loss_fgd_fpn_4: 2.1541, loss_fgd_fpn_3: 0.8034, loss_fgd_fpn_2: 0.1834, loss_fgd_fpn_1: 0.6773, loss_fgd_fpn_0: 2.6396, loss: 6.7835, grad_norm: 174.3350 2023-09-06 18:31:47,209 - mmdet - INFO - Epoch [2][100/203] lr: 6.044e-04, eta: 1:29:06, time: 0.635, data_time: 0.020, memory: 5745, loss_cls: 0.1480, loss_bbox: 0.1469, loss_fgd_fpn_4: 2.1242, loss_fgd_fpn_3: 0.6617, loss_fgd_fpn_2: 0.1525, loss_fgd_fpn_1: 0.5729, loss_fgd_fpn_0: 2.2660, loss: 6.0722, grad_norm: 183.8174 2023-09-06 18:32:18,974 - mmdet - INFO - Epoch [2][150/203] lr: 7.043e-04, eta: 1:22:25, time: 0.635, data_time: 0.021, memory: 5745, loss_cls: 0.1637, loss_bbox: 0.1636, loss_fgd_fpn_4: 1.7635, loss_fgd_fpn_3: 0.6042, loss_fgd_fpn_2: 0.1576, loss_fgd_fpn_1: 0.5902, loss_fgd_fpn_0: 2.3537, loss: 5.7965, grad_norm: 154.4271 2023-09-06 18:32:50,011 - mmdet - INFO - Epoch [2][200/203] lr: 8.042e-04, eta: 1:17:08, time: 0.621, data_time: 0.029, memory: 5745, loss_cls: 0.1569, loss_bbox: 0.1697, loss_fgd_fpn_4: 3.1833, loss_fgd_fpn_3: 0.7770, loss_fgd_fpn_2: 0.1706, loss_fgd_fpn_1: 0.6309, loss_fgd_fpn_0: 2.3866, loss: 7.4749, grad_norm: 227.1264 bbox_mAP: 0.6870
`2023-09-06 18:36:43,303 - mmdet - INFO - Epoch [3][50/203] lr: 9.101e-04, eta: 1:25:49, time: 2.288, data_time: 1.640, memory: 5745, loss_cls: 0.4277, loss_bbox: 0.2139, loss_fgd_fpn_4: 4.8695, loss_fgd_fpn_3: 1.1391, loss_fgd_fpn_2: 0.2268, loss_fgd_fpn_1: 0.8387, loss_fgd_fpn_0: 3.3569, loss: 11.0726, grad_norm: 293.6464 2023-09-06 18:37:14,756 - mmdet - INFO - Epoch [3][100/203] lr: 1.000e-03, eta: 1:20:59, time: 0.629, data_time: 0.019, memory: 5745, loss_cls: 0.4082, loss_bbox: 0.2028, loss_fgd_fpn_4: 2.1853, loss_fgd_fpn_3: 0.7505, loss_fgd_fpn_2: 0.2096, loss_fgd_fpn_1: 0.7672, loss_fgd_fpn_0: 3.3458, loss: 7.8694, grad_norm: 174.7761 2023-09-06 18:37:46,440 - mmdet - INFO - Epoch [3][150/203] lr: 1.000e-03, eta: 1:16:57, time: 0.634, data_time: 0.022, memory: 5745, loss_cls: 0.2216, loss_bbox: 0.1832, loss_fgd_fpn_4: 2.8286, loss_fgd_fpn_3: 0.7335, loss_fgd_fpn_2: 0.1637, loss_fgd_fpn_1: 0.5890, loss_fgd_fpn_0: 2.2849, loss: 7.0046, grad_norm: 201.6889 2023-09-06 18:38:18,894 - mmdet - INFO - Epoch [3][200/203] lr: 1.000e-03, eta: 1:13:36, time: 0.649, data_time: 0.017, memory: 5745, loss_cls: 0.4074, loss_bbox: 0.2027, loss_fgd_fpn_4: 2.2815, loss_fgd_fpn_3: 0.8024, loss_fgd_fpn_2: 0.1913, loss_fgd_fpn_1: 0.7121, loss_fgd_fpn_0: 2.5630, loss: 7.1605, grad_norm: 149.1443
bbox_mAP: 0.4750`
It seems strange. Does the baseline keep the same that the first epoch performs best? Does the learning rate for distillation keep the same as baseline.
Thanks. The problem was learning rate. I reduced my learning rate and now the optimization is working. I will share the result under this post for reference.
My other question is : What **loss_fgd_fpn_0 to loss_fgd_fpn_4*** I mean is loss_fgd_fpn_0 Lat and loss_fgd_fpn_4 is Lfocal?
I am using fgd_retina_r101_fpn_2x_distill_retina_r50_fpn_2x_coco.py to distill Retinanet-r101 (Using my own CoCo formatted) with mAP 0.75 but my Retinanet-r50 mAP significantly drops to 0.58. I am using SGD with same parameters set in orignal config. Training stops with loss_cls around 2.5 after 20 epoch:
"epoch": 20, "iter": 200, "lr": 0.001, "memory": 5598, "data_time": 0.03356, "loss_cls": 0.25744, "loss_bbox": 0.19824, "loss_fgd_fpn_4": 2.68374, "loss_fgd_fpn_3": 2.18955, "loss_fgd_fpn_2": 0.43481, "loss_fgd_fpn_1": 1.17425, "loss_fgd_fpn_0": 4.25011, "loss": 11.18815, "grad_norm": 53.40476, "time": 0.65314 "bbox_mAP": 0.561
"epoch": 21, "iter": 200, "lr": 0.001, "memory": 5598, "data_time": 0.02153, "loss_cls": 0.26974, "loss_bbox": 0.21168, "loss_fgd_fpn_4": 2.03119, "loss_fgd_fpn_3": 1.96431, "loss_fgd_fpn_2": 0.43961, "loss_fgd_fpn_1": 1.17302, "loss_fgd_fpn_0": 4.24643, "loss": 10.33597, "grad_norm": 43.16615, "time": 0.60769} "bbox_mAP": 0.571,
"epoch": 22, "iter": 200, "lr": 0.001, "memory": 5598, "data_time": 0.02091, "loss_cls": 0.25263, "loss_bbox": 0.19929, "loss_fgd_fpn_4": 1.99079, "loss_fgd_fpn_3": 1.83311, "loss_fgd_fpn_2": 0.42802, "loss_fgd_fpn_1": 1.14707, "loss_fgd_fpn_0": 4.15834, "loss": 10.00925, "grad_norm": 46.78779, "time": 0.62421 bbox_mAP": 0.578,
"epoch": 23, "iter": 200, "lr": 0.0001, "memory": 5598, "data_time": 0.01811, "loss_cls": 0.25804, "loss_bbox": 0.20361, "loss_fgd_fpn_4": 1.42493, "loss_fgd_fpn_3": 1.68207, "loss_fgd_fpn_2": 0.41495, "loss_fgd_fpn_1": 1.12626, "loss_fgd_fpn_0": 4.09925, "loss": 9.20911, "grad_norm": 20.09985, "time": 0.60161 "bbox_mAP": 0.577,
"epoch": 24, "iter": 200, "lr": 0.0001, "memory": 5598, "data_time": 0.01983, "loss_cls": 0.25139, "loss_bbox": 0.1968, "loss_fgd_fpn_4": 1.32875, "loss_fgd_fpn_3": 1.57256, "loss_fgd_fpn_2": 0.41188, "loss_fgd_fpn_1": 1.11039, "loss_fgd_fpn_0": 4.06505, "loss": 8.93682, "grad_norm": 19.73, "time": 0.62296} "bbox_mAP": 0.581,
Is this happening because I am distilling only using my fine tuning data? If not what can be the problem?