Open dinhanhx opened 2 years ago
Hi,
I was inspired by "Convolutional Self-Attention Networks" [2], and implemented the two-dimensional ConViT
model for vision tasks from scratch. Yang et al. [2] mainly proposed one-dimensional convolutional transformers for natural language tasks. As far as I know, no official implementation of [2] is provided.
I will add the reference [2] to convit.py
. Thank you for your feedback!
[2] Baosong Yang, Longyue Wang, Derek F Wong, Lidia S Chao, and Zhaopeng Tu. "Convolutional self-attention networks". NAACL, 2019.
@xxxnell Uhm so what are the differences between these two attention mechanism? https://github.com/xxxnell/how-do-vits-work/blob/8752f4e330a38877c628dfa40d57fa9404bb3131/models/attentions.py#L68-L101
and
Attention2d
in models/attentions.py
is traditional global self-attention. ConvAttention2d
in models/convit.py
is convolutional self-attention, and it is a kind of local self-attention. ConvAttention2d
calculates self-attention only between tokens in convolutional receptive fields (e.g., 3x3) after unfolding the tokens like Conv2d.
I think I understand now. Just one more question, if I use Attention2D
in models/attentions.py
, I should cite your paper,right?
Yes. I'd really appreciate it if you would cite my paper.
@xxxnell Quick question, which part of your publication mentioned Attention2D
in models/attentions.py
? From what I read, you only mentioned MSA from vanilla transformer.
@dinhanhx Oh! Sorry for the confusion. Attention2d
in models/attentions.py
is almost identical to traditional MSA in vanilla ViT, so I think you should cite the original ViT paper. Please cite my paper only if you have used or modified my code and implementation directly.
@xxxnell well I found that your Attention2D
in models/attentions.py
is kinda similar to this one https://github.com/lucidrains/vit-pytorch/blob/c2aab05ebfb01b9740ba4dae5b515fce1140e97d/vit_pytorch/cvt.py#L70-L102 from CvT: Introducing Convolutions to Vision Transformers. From my understanding, the major difference is the number of CNN layers to project qkv.
@dinhanhx Ah, I think now I understand what you pointed out! I initially used two Convs for qkv
to improve the performance of AlterNet. So there was an experiment and discussion on stride k
in self.to_kv
in the first draft, but they were removed in the final revision for better readability. As a result, in the context of my paper, I didn't take advantage of the two Convs and the stride attribute for the sake of simplicity, and it also looks good to me to use one Conv instead of two Convs. In addition, since a lot of my implementations are based on https://github.com/lucidrains/vit-pytorch, I think it's also great to cite the original project to use the code.
@xxxnell It was confusing to me since there are few similar convolution attention mechanism like yours. I did have a hard time trying to differentiate them.
if I use AlterNet (theory), I cite your paper.
if I use AlterNet (code), I cite your paper, and the original project https://github.com/lucidrains/vit-pytorch.
if I only use Attention2D
in models/attentions.py
, I cite your paper, that CvT paper, and the original project.
Right?
@dinhanhx Right. I think what you said is one of the best practices.
@dinhanhx Right. I think what you said is one of the best practices.
Thanks for supporting me!
Thanks for your comments on our "Convolutional SANs" (https://arxiv.org/abs/1904.03107). We are also very happy to see this can inspire your work. The paper on analyzing Vision Transformers is really insightful and interesting.
Hi,
I was inspired by "Convolutional Self-Attention Networks" [2], and implemented the two-dimensional
ConViT
model for vision tasks from scratch. Yang et al. [2] mainly proposed one-dimensional convolutional transformers for natural language tasks. As far as I know, no official implementation of [2] is provided.I will add the reference [2] to
convit.py
. Thank you for your feedback![2] Baosong Yang, Longyue Wang, Derek F Wong, Lidia S Chao, and Zhaopeng Tu. "Convolutional self-attention networks". NAACL, 2019.
We have implemented various SANs including "Convolutional SANs" at: https://github.com/baosongyang/Context-Aware-SAN/blob/main/layers/attention_conv.py.
Hi @longyuewangdcu ,
Thank you for the great paper and your kind words. And sorry I missed that implementation. I starred the repository, and I'll take a closer look!
https://github.com/xxxnell/how-do-vits-work/blob/8752f4e330a38877c628dfa40d57fa9404bb3131/models/convit.py#L1-L6
You said it's not the same with ConVit by d'Ascoli, Stéphane, et al. Then where does this ConVit come from? I ask because if I reuse this code, I want to know whom I should cite.