zhihou7 / BatchFormer

CVPR2022, BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning, https://arxiv.org/abs/2203.01522
246 stars 20 forks source link

batchformerV2 on detr #9

Closed OBVIOUSDAWN closed 2 years ago

OBVIOUSDAWN commented 2 years ago

Thanks for your excellent work. I tried to apply the improved batchformerV2 code based on deformable-detr to DETR with the following modifications here . Also, I modified the parameters in main.py. But when I tried to train the new network, I encountered a problem.

The modified network has faster convergence and higher accuracy in the early stage of training, but the final accuracy is the same as the unmodified DETR. These are the logs corresponding changed previous to the two tasks. both networks use the same training parameters and environment. I am wondering if there are errors in my modified code that caused it not to work in the end, because on the same baseline used distillation to bring the accuracy to 59.

I am looking forward to your reply. Thank you and best regards

zhihou7 commented 2 years ago

Hi, Thanks for your interest. At first, the dimension of DETR is different from Deformable-DETR by default. Therefore, This line(https://github.com/OBVIOUSDAWN/tmpchange/blob/c315ee2623cf119ab5b8d1cec889f3f27e0d0017/transformer.py#L115) may be not right. I have provided a gist for DETR here

Besides, I indeed find a similar trend during the training phase. BatchFormerV2 enables faster convergence while achieving decreased improvement after many epochs (e.g, with 500 epochs, empirically, BatchFormerV2 (inserted in the first layer) only improves DETR by 0.5-1.0%). However, we improve DETR by 1.7% for panoptic segmentation with this pre-trained model.

Lastly, interesting experiment result! You actually use a similar dimension with DETR, but with an additional layer. This is similar to our real baseline (\ie, increase the layers and params during training, but do not use batch transformer). Empirically, it also improves the model a bit when I evaluate it in DN-DETR. if you need it, I can share the code on DETR with you (it might be messy).

Feel free if you have further questions.

OBVIOUSDAWN commented 2 years ago

Hi, Thanks for your interest. At first, the dimension of DETR is different from Deformable-DETR by default. Therefore, This line(https://github.com/OBVIOUSDAWN/tmpchange/blob/c315ee2623cf119ab5b8d1cec889f3f27e0d0017/transformer.py#L115) may be not right. I have provided a gist for DETR here

Besides, I indeed find a similar trend during the training phase. BatchFormerV2 enables faster convergence while achieving decreased improvement after many epochs (e.g, with 500 epochs, empirically, BatchFormerV2 (inserted in the first layer) only improves DETR by 0.5-1.0%). However, we improve DETR by 1.7% for panoptic segmentation with this pre-trained model.

Lastly, interesting experiment result! You actually use a similar dimension with DETR, but with an additional layer. This is similar to our real baseline (\ie, increase the layers and params during training, but do not use batch transformer). Empirically, it also improves the model a bit when I evaluate it in DN-DETR. if you need it, I can share the code on DETR with you (it might be messy).

Feel free if you have further questions.

Thank you for sharing your code, I'm really sorry for not being able to reply in time. Thanks again for your help!

Also, you mentioned that the pre-trained model obtained using the improved DETR of Batchformerv2 resulted in a 1.7% improvement in the panoramic segmentation DETR. I would like to know if the DETR segmentation model was improved using batchformerV2 in this result, or was it left completely in its original state?

Finally, you mentioned that adding layers and parameters directly during training can improve the model slightly. Is it my understanding that batchformerV2 has the same effect, except that it builds a slightly different layer than the transformer encoder layer in DETR? Are you fine-tuning on detr pre-training after modifying the number of layers? Or do you start training from scratch?

I've read a paper before (probably comes from ShangTang) that mentions that you can achieve the same results as detr-R50 using only three encoder layers and a decoder layer, but unfortunately it's been two years and it's not open source yet. The impact of the different layers you mentioned seems to be similar to that paper.

Thanks again for your reply. I am looking forward to your reply. Thank you and best regards.

OBVIOUSDAWN commented 2 years ago

Also, I noticed that in the code you shared, you used swin as backbone to replace the original resnet50. I have done many similar attempts before, but I found a bad phenomenon that was hard to converge. The loss was stuck at 40 and the mAP was very low. In QAHOI, which is a downstream task related to detr, it only uses swin weights as pre-training and transformser parameters to initialize and get good convergence. I have asked many people how to make swin converge and they mentioned training the whole model and then using the obtained weights for other tasks. I would like to ask if you follow the preset parameters when training with swin as the backbone, i.e.: lr=1e-4, lr_backbone=1e-5, weight_decay=1e-4. Did you get better results with swin? Also is it necessary to freeze the backbone as they did for resnet50 because in my understanding the backbone be fixed, the workload of training the transformer will drop dramatically? Thanks again for your reply.

zhihou7 commented 2 years ago

Hi,

  1. I follow this instrument to run detr for segmentation. There is a pretraining stage for the boxes. I add batchformerv2 from this stage from scratch.

  2. I train it from scratch. It is the same as DETR, except for that we add a batchformerv2 module in the batchformerv2 stream. It can improve a bit. However, it is not as good as batchformerv2.

  3. Maybe. However, I remember I tried to use 7 layers of transformer encoder due to other ideas (I do not use two branches). I do not observe further improvement or a similar trend. The two cases might be correlated, but not much similar.

zhihou7 commented 2 years ago

Also, I noticed that in the code you shared, you used swin as backbone to replace the original resnet50. I have done many similar attempts before, but I found a bad phenomenon that was hard to converge. The loss was stuck at 40 and the mAP was very low. In QAHOI, which is a downstream task related to detr, it only uses swin weights as pre-training and transformser parameters to initialize and get good convergence. I have asked many people how to make swin converge and they mentioned training the whole model and then using the obtained weights for other tasks. I would like to ask if you follow the preset parameters when training with swin as the backbone, i.e.: lr=1e-4, lr_backbone=1e-5, weight_decay=1e-4. Did you get better results with swin? Also is it necessary to freeze the backbone as they did for resnet50 because in my understanding the backbone be fixed, the workload of training the transformer will drop dramatically? Thanks again for your reply.

I have observed similar issues. The repository that I shared with you includes many messy codes. Actually, I change the backbone for Qpic(HOI Detection), not for Object Detection. I remember that I also fail to achieve comparable results to resnet50 backbone. However, I didn't dig it deep. Maybe, I just trained the network for a few epochs. I haven't followed HOI for a long time.

However, I indeed evaluate BatchFormerV2 for HOI detection based on QPIC. I find I can improve Qpic by about 0.5%. The pretraining model is the same as Qpic.

OBVIOUSDAWN commented 2 years ago

Thank you for your help. I read the corresponding task and it is using the model obtained from object detection as a base to perform the segmentation task, so the accuracy advantage will be magnified exponentially.

The biggest problem with QPIC should be that it does not provide a pre-trained model, i.e. a model trained on the object detrction task to be used as a pre-trained training HOI, a process that requires a lot of computing power but should improve accuracy.

Thanks again for your help.

zhihou7 commented 2 years ago

Hi, @OBVIOUSDAWN,

Thanks for your reply. QPIC uses the pre-trained model provided in DETR.