About the raw performance without distill

hi, I trained DistilPose-S model without distillation. To achieve this, I simply set these four losses to None: loss_keypoint=dict(type='SmoothL1Loss', use_target_weight=True), loss_vis_token_dist=None, loss_kpt_token_dist=None, loss_score=None, loss_reg2hm=None, But during the training process, the "acc_pose" continued to rise, the AP has always fluctuated below 0.1, which confused me a lot. Here are the part of the log file: { "mode": "val", "epoch": 10, "iter": 1085, "lr": 0.001, "AP": 0.0211, "mode": "val", "epoch": 20, "iter": 1085, "lr": 0.001, "AP": 0.0287, "mode": "val", "epoch": 30, "iter": 1085, "lr": 0.001, "AP": 0.03328, "mode": "val", "epoch": 40, "iter": 1085, "lr": 0.001, "AP": 0.03867, "mode": "val", "epoch": 50, "iter": 1085, "lr": 0.001, "AP": 0.04421, } when I trained model with distillation, everything is fine and when I evaluated L_VT , the AP acts normal so this phenomenon is even more strange If my operation is wrong, please let me know. I would like to know how you obtained this result： Distillation | Simulated Heatmaps | TDE | AP | Improv. No - - - - 56.0% - Also I'm curious about why you use SmoothL1 Loss rather than RLE Loss :)

Traditional regression-based models only output 2 vaues, the coordinates x and y.
However, DistilPose outputs 5 values, the coordinates x, y; the deviation _$\sigma x$ and _$\sigma y$; the confidence score s. The confidence score s will be take into consideration during the evaluation of mAP. Traditional regression-based models can't predict confidence scores for keypoint prediction, so they default s=1. RLE provides confidence scores via flow model, and DistilPose gets confidence scores under guidance of teacher heatmaps. Therefore, if you simply set the distillation loss to None, the model still output the deviation values _$\sigma x$ and _$\sigma y$ and confidence scores s, but they are under no supervision, so they shall be very poor prediction.

Back to your issue about acc_pose and AP. acc_pose calculate the accuracy using PCK, which do not take confidence scores into consideration(see code here-> _func keypoint_pckaccuracy), so it 'continued to rise'. Meanwhile, AP takes the poor confidence scores into consideration(see code here->func decode and here->func evaluate), so the mAP is very low. You can modify the code here->line249 like this, and you might get a more reasonable accuracy:
```
# all_preds[:, :, 2:3] = output_score.detach().cpu().numpy()    
all_preds[:, :, 2:3] = maxvals
```
We train DistilPose without distillation, and only output the coordinates x and y, and get the 56.0% mAP. (Which means that we do not predict the deviations and confidence score).
TBH, we have tried replacing SmoothL1 with RLE, but the performance gain is negligible, sometimes even causing a drop on mAP. We assume that the reason is that the optimazation directions of RLE and DistilPose is different. DistilPose learns from heatmap-based teacher, while RLE aims to learn the bias of annotation. But we are not sure.

Sincerely do I hope that my reply will be helpful to you.

yshMars / DistilPose

About the raw performance without distill #2