ziplab / LITv2

[NeurIPS 2022 Spotlight] This is the official PyTorch implementation of "Fast Vision Transformers with HiLo Attention"
Apache License 2.0
241 stars 14 forks source link

Question about the code difference between the classification and segmentation #8

Closed Leiyi-Hu closed 1 year ago

Leiyi-Hu commented 1 year ago

Hi, Thanks for your excellent work, it's really elegant to accelerate the ViT. I have some questions about the code (the backbone part) of classification and segmentation, it seems that they are a little different from each other. so, do they really have different designs or just different implementations of the same model? Looking forward to your reply!

HubHop commented 1 year ago

Hi Leiyi,

Thanks for your interest! In short, they are the same model (same design) but with minor adaptions for different tasks.

For example, on image classification, the input image resolution is usually fixed and we only care about the output logits. However, on dense prediction tasks such as segmentation, the code has been updated in order to deal with different image resolutions. Besides, we add some normalization layers in the backbone (see here) to further process the pyramid feature maps. Hope this can answer your question.

Best, Zizheng

Leiyi-Hu commented 1 year ago

Hi Zizheng, Thanks for your reply, you said that you update the code in order to deal with different image resolutions, but I just found that maybe you fix the input as 512 just for ade20k like the following:


# absolute position embedding
        input_resolution = [
            512 // patch_size,
            512 // patch_size,
        ]
``` in segmentation/mmseg/models/backbones/litv2.py
<img width="382" alt="image" src="https://user-images.githubusercontent.com/66046309/207588136-baab6e62-3b71-48c4-af07-d5e9522fb600.png">
and, I want to make sure whether only the normalization layers have been changed, or if there are any other changes.

Best regards,
Leiyi
HubHop commented 1 year ago

Hi Leiyi, note that we do not use absolute position embedding in LITv2, so please just ignore this line of code. Besides, for semantic segmentation on ADE20K, it is a common practice to set the training image resolution as 512x512. However, for object detection on COCO, the input resolution is not fixed due to data augmentation. There is no other changes in the architecture since we need to load the pretrained weights from ImageNet training. Otherwise there will be a mismatch between architectures.

Best, Zizheng

Leiyi-Hu commented 1 year ago

Hi Zizheng, Thanks a lot! Maybe I'll change the resolution according to my dataset, and, another question, you mentioned that the smaller param alpha for semantic segmentation may be better, so if the value of alpha changed, while adjusting the local window size s, will these modifications affect the loading of weights? or If I want to change these params, I have to pretrain the model from scratch? ( according to the code, when alpha is fixed, the linear proj for LoFi and HiFi are determined) best, Leiyi

HubHop commented 1 year ago

Good question!

  1. Change the local window size will not affect the model parameters since average pooling and window partition do not require learnable parameters, thus you can still load the previous ImageNet pretrained weights.
  2. Change the value of alpha will allocate different number of heads to Hi-Fi and Lo-Fi branches in the proposed HiLo attention, which will result in a different computational graph. In this case, you will need to retrain the model on ImageNet from scratch with this new alpha.

To help further research, we just uploaded ImageNet pretrained LITv2-S with different choices of alpha. You can find these checkpoints here.

Cheers, Zizheng

Leiyi-Hu commented 1 year ago

Thank you very much! Your reply is very detailed and helpful! Best, Leiyi