Closed iumyx2612 closed 2 years ago
Hi @iumyx2612, thank you for reaching out.
Simply put, "data-specificity" means "data-dependency". In the self-attention equation (Eq. 1), the Softmax(QK)
term corresponds to the kernel of convolution layers (and the Value corresponds to the representation). Here, both Q
and K
are data dependent (e.g., Q = X W_Q
). Therefore, "the self-attention kernel", Softmax(QK)
, is data-dependent (data specific) whereas the Conv kernel is data-independent (data agnostic).
Hi @iumyx2612, thank you for reaching out.
Simply put, "data-specificity" means "data-dependency". In the self-attention equation (Eq. 1), the
Softmax(QK)
term corresponds to the kernel of convolution layers (and the Value corresponds to the representation). Here, bothQ
andK
are data dependent (e.g.,Q = X W_Q
). Therefore, "the self-attention kernel",Softmax(QK)
, is data-dependent (data specific) whereas the Conv kernel is data-independent (data agnostic).
OMG the explanation is simple and really easy to understand. Nice!
@xxxnell Sorry but I have a kinda stupid question again.
Let x
is the input feature map, output feature map y
is computed as: y = f(x) * x
where *
is multiplication, f
is a kxk
Conv layer (where k
is equal to the height or width of the feature maps). Can this be treated as attention mechanism, or in other words: data-dependency?
Hi @iumyx2612. If the kernel of a Conv layer f(x)
depends on feature maps x
, then we can say that the Conv layer is data specific (data dependent). However, the Conv layer may not behave like self-attention because the kernel is not positive definite.
Hi @iumyx2612. If the kernel of a Conv layer
f(x)
depends on feature mapsx
, then we can say that the Conv layer is data specific (data dependent). However, the Conv layer may not behave like self-attention because the kernel is not positive definite.
Here I consider the whole f(x) * x
is a big conv. And this big conv has its kernel being f(x)
. Here, the kernel of this big conv is data dependent (because it depends on the input x
, and its kernel is generate through a conv
layer) --> So the whole f(x) * x
is a data dependent operator right?
In self-attention, its kernel is generate through a linear
layer, which has an unchanged weight matrix during inference just like a conv
weight
Here I consider the whole
f(x) * x
is a big conv. And this big conv has its kernel beingf(x)
. Here, the kernel of this big conv is data dependent (because it depends on the inputx
, and its kernel is generate through aconv
layer) --> So the wholef(x) * x
is a data dependent operator right?
Yes, that's correct. f(x) * x
is a data dependent term.
In self-attention, its kernel is generate through a
linear
layer, which has an unchanged weight matrix during inference just like aconv
weight
As you correctly pointed out, Q = X W_Q
and K = X W_K
are linear operators and W_Q
/W_K
are data independent, but the entire Softmax(QK)
term is data dependent.
In addition, I would like to point out that a lot of properties of self-attention come from the softmax
function as well as its data dependency. The self-attention formulation can be deemed as an average of feature map values with the positive normalized importance-weights (the softmax
term).
Here I consider the whole
f(x) * x
is a big conv. And this big conv has its kernel beingf(x)
. Here, the kernel of this big conv is data dependent (because it depends on the inputx
, and its kernel is generate through aconv
layer) --> So the wholef(x) * x
is a data dependent operator right?Yes, that's correct.
f(x) * x
is a data dependent term.In self-attention, its kernel is generate through a
linear
layer, which has an unchanged weight matrix during inference just like aconv
weightAs you correctly pointed out,
Q = X W_Q
andK = X W_K
are linear operators andW_Q
/W_K
are data independent, but the entireSoftmax(QK)
term is data dependent.In addition, I would like to point out that a lot of properties of self-attention come from the
softmax
function as well as its data dependency. The self-attention formulation can be deemed as an average of feature map values with the positive normalized importance-weights (thesoftmax
term).
Ooh, thank you for sharing knowledge! Great to have a conversation with you
In the paper, authors state that "A key feature of MSAs is data specificity (not long-range dependency)".
Can you explain about the "data specificity" part? What is it, and how it behaves?
Further more, can you elaborate how MSAs (through visualization, formulas, etc) achieves data specificity