xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
806 stars 79 forks source link

In convit.py file, where does ConVit come from, really? #9

Open dinhanhx opened 2 years ago

dinhanhx commented 2 years ago

https://github.com/xxxnell/how-do-vits-work/blob/8752f4e330a38877c628dfa40d57fa9404bb3131/models/convit.py#L1-L6

You said it's not the same with ConVit by d'Ascoli, Stéphane, et al. Then where does this ConVit come from? I ask because if I reuse this code, I want to know whom I should cite.

xxxnell commented 2 years ago

Hi,

I was inspired by "Convolutional Self-Attention Networks" [2], and implemented the two-dimensional ConViT model for vision tasks from scratch. Yang et al. [2] mainly proposed one-dimensional convolutional transformers for natural language tasks. As far as I know, no official implementation of [2] is provided.

I will add the reference [2] to convit.py. Thank you for your feedback!

[2] Baosong Yang, Longyue Wang, Derek F Wong, Lidia S Chao, and Zhaopeng Tu. "Convolutional self-attention networks". NAACL, 2019.

dinhanhx commented 2 years ago

@xxxnell Uhm so what are the differences between these two attention mechanism? https://github.com/xxxnell/how-do-vits-work/blob/8752f4e330a38877c628dfa40d57fa9404bb3131/models/attentions.py#L68-L101

and

https://github.com/xxxnell/how-do-vits-work/blob/8752f4e330a38877c628dfa40d57fa9404bb3131/models/convit.py#L19-L66

xxxnell commented 2 years ago

Attention2d in models/attentions.py is traditional global self-attention. ConvAttention2d in models/convit.py is convolutional self-attention, and it is a kind of local self-attention. ConvAttention2d calculates self-attention only between tokens in convolutional receptive fields (e.g., 3x3) after unfolding the tokens like Conv2d.

dinhanhx commented 2 years ago

I think I understand now. Just one more question, if I use Attention2D in models/attentions.py, I should cite your paper,right?

xxxnell commented 2 years ago

Yes. I'd really appreciate it if you would cite my paper.

dinhanhx commented 2 years ago

@xxxnell Quick question, which part of your publication mentioned Attention2D in models/attentions.py? From what I read, you only mentioned MSA from vanilla transformer.

xxxnell commented 2 years ago

@dinhanhx Oh! Sorry for the confusion. Attention2d in models/attentions.py is almost identical to traditional MSA in vanilla ViT, so I think you should cite the original ViT paper. Please cite my paper only if you have used or modified my code and implementation directly.

dinhanhx commented 2 years ago

@xxxnell well I found that your Attention2D in models/attentions.py is kinda similar to this one https://github.com/lucidrains/vit-pytorch/blob/c2aab05ebfb01b9740ba4dae5b515fce1140e97d/vit_pytorch/cvt.py#L70-L102 from CvT: Introducing Convolutions to Vision Transformers. From my understanding, the major difference is the number of CNN layers to project qkv.

xxxnell commented 2 years ago

@dinhanhx Ah, I think now I understand what you pointed out! I initially used two Convs for qkv to improve the performance of AlterNet. So there was an experiment and discussion on stride k in self.to_kv in the first draft, but they were removed in the final revision for better readability. As a result, in the context of my paper, I didn't take advantage of the two Convs and the stride attribute for the sake of simplicity, and it also looks good to me to use one Conv instead of two Convs. In addition, since a lot of my implementations are based on https://github.com/lucidrains/vit-pytorch, I think it's also great to cite the original project to use the code.

dinhanhx commented 2 years ago

@xxxnell It was confusing to me since there are few similar convolution attention mechanism like yours. I did have a hard time trying to differentiate them.

if I use AlterNet (theory), I cite your paper. if I use AlterNet (code), I cite your paper, and the original project https://github.com/lucidrains/vit-pytorch. if I only use Attention2D in models/attentions.py, I cite your paper, that CvT paper, and the original project.

Right?

xxxnell commented 2 years ago

@dinhanhx Right. I think what you said is one of the best practices.

dinhanhx commented 2 years ago

@dinhanhx Right. I think what you said is one of the best practices.

Thanks for supporting me!

longyuewangdcu commented 2 years ago

Thanks for your comments on our "Convolutional SANs" (https://arxiv.org/abs/1904.03107). We are also very happy to see this can inspire your work. The paper on analyzing Vision Transformers is really insightful and interesting.

longyuewangdcu commented 2 years ago

Hi,

I was inspired by "Convolutional Self-Attention Networks" [2], and implemented the two-dimensional ConViT model for vision tasks from scratch. Yang et al. [2] mainly proposed one-dimensional convolutional transformers for natural language tasks. As far as I know, no official implementation of [2] is provided.

I will add the reference [2] to convit.py. Thank you for your feedback!

[2] Baosong Yang, Longyue Wang, Derek F Wong, Lidia S Chao, and Zhaopeng Tu. "Convolutional self-attention networks". NAACL, 2019.

We have implemented various SANs including "Convolutional SANs" at: https://github.com/baosongyang/Context-Aware-SAN/blob/main/layers/attention_conv.py.

xxxnell commented 2 years ago

Hi @longyuewangdcu ,

Thank you for the great paper and your kind words. And sorry I missed that implementation. I starred the repository, and I'll take a closer look!