xxxnell / how-do-vits-work

(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
https://arxiv.org/abs/2202.06709
Apache License 2.0
806 stars 79 forks source link

What exactly makes MSAs data specificity? #26

Closed iumyx2612 closed 2 years ago

iumyx2612 commented 2 years ago

In the paper, authors state that "A key feature of MSAs is data specificity (not long-range dependency)".

Can you explain about the "data specificity" part? What is it, and how it behaves?

Further more, can you elaborate how MSAs (through visualization, formulas, etc) achieves data specificity

xxxnell commented 2 years ago

Hi @iumyx2612, thank you for reaching out.

Simply put, "data-specificity" means "data-dependency". In the self-attention equation (Eq. 1), the Softmax(QK) term corresponds to the kernel of convolution layers (and the Value corresponds to the representation). Here, both Q and K are data dependent (e.g., Q = X W_Q). Therefore, "the self-attention kernel", Softmax(QK), is data-dependent (data specific) whereas the Conv kernel is data-independent (data agnostic).

iumyx2612 commented 2 years ago

Hi @iumyx2612, thank you for reaching out.

Simply put, "data-specificity" means "data-dependency". In the self-attention equation (Eq. 1), the Softmax(QK) term corresponds to the kernel of convolution layers (and the Value corresponds to the representation). Here, both Q and K are data dependent (e.g., Q = X W_Q). Therefore, "the self-attention kernel", Softmax(QK), is data-dependent (data specific) whereas the Conv kernel is data-independent (data agnostic).

OMG the explanation is simple and really easy to understand. Nice!

iumyx2612 commented 1 year ago

@xxxnell Sorry but I have a kinda stupid question again.
Let x is the input feature map, output feature map y is computed as: y = f(x) * x where * is multiplication, f is a kxk Conv layer (where k is equal to the height or width of the feature maps). Can this be treated as attention mechanism, or in other words: data-dependency?

xxxnell commented 1 year ago

Hi @iumyx2612. If the kernel of a Conv layer f(x) depends on feature maps x, then we can say that the Conv layer is data specific (data dependent). However, the Conv layer may not behave like self-attention because the kernel is not positive definite.

iumyx2612 commented 1 year ago

Hi @iumyx2612. If the kernel of a Conv layer f(x) depends on feature maps x, then we can say that the Conv layer is data specific (data dependent). However, the Conv layer may not behave like self-attention because the kernel is not positive definite.

Here I consider the whole f(x) * x is a big conv. And this big conv has its kernel being f(x). Here, the kernel of this big conv is data dependent (because it depends on the input x, and its kernel is generate through a conv layer) --> So the whole f(x) * x is a data dependent operator right?
In self-attention, its kernel is generate through a linear layer, which has an unchanged weight matrix during inference just like a conv weight

xxxnell commented 1 year ago

Here I consider the whole f(x) * x is a big conv. And this big conv has its kernel being f(x). Here, the kernel of this big conv is data dependent (because it depends on the input x, and its kernel is generate through a conv layer) --> So the whole f(x) * x is a data dependent operator right?

Yes, that's correct. f(x) * x is a data dependent term.

In self-attention, its kernel is generate through a linear layer, which has an unchanged weight matrix during inference just like a conv weight

As you correctly pointed out, Q = X W_Q and K = X W_K are linear operators and W_Q/W_K are data independent, but the entire Softmax(QK) term is data dependent.

In addition, I would like to point out that a lot of properties of self-attention come from the softmax function as well as its data dependency. The self-attention formulation can be deemed as an average of feature map values with the positive normalized importance-weights (the softmax term).

iumyx2612 commented 1 year ago

Here I consider the whole f(x) * x is a big conv. And this big conv has its kernel being f(x). Here, the kernel of this big conv is data dependent (because it depends on the input x, and its kernel is generate through a conv layer) --> So the whole f(x) * x is a data dependent operator right?

Yes, that's correct. f(x) * x is a data dependent term.

In self-attention, its kernel is generate through a linear layer, which has an unchanged weight matrix during inference just like a conv weight

As you correctly pointed out, Q = X W_Q and K = X W_K are linear operators and W_Q/W_K are data independent, but the entire Softmax(QK) term is data dependent.

In addition, I would like to point out that a lot of properties of self-attention come from the softmax function as well as its data dependency. The self-attention formulation can be deemed as an average of feature map values with the positive normalized importance-weights (the softmax term).

Ooh, thank you for sharing knowledge! Great to have a conversation with you