Closed zplovekq closed 4 years ago
In our implementation, the residual connection in the Transformer always takes the first layer's representation. Another option is to have the residual connection take the previous layer's representation.
Thanks for your Reply! I mean in this formula , how could the $LN(Z_V^{[0]})$ become $LN(Z_V^{[i-1]})$ or if i understand something wrong...?
@zplovekq Actually, there is no Z_\beta^{[i]}. This is probably more straightforward to see in Figure 2: for instance, for the language (L) modality, there is a V->L transformer and an A->L transformer. So there is no ZL^{[i]} in this stage, but only Z{A->L} or Z_{V->L}... That's the whole point of doing cross-modal self-attention.
While you can certainly take the intermediate layers of the A->L transformer (i.e., Z{A->L}^{[i]}) and the intermediate layers of the V->L transformer (i.e., Z{V->L}^{[i]}) by intercepting the for loop (https://github.com/yaohungt/Multimodal-Transformer/blob/master/modules/transformer.py#L81), there is no guarantee that these two transformations are of the same depth. Thus, I would be cautious of defining/using Z_V^{[i]}.
Hope this helps.
Sincere thanks .That really helps. In the Figure 3(b) layer0,the input is Z{\alpha}^{[0]} and Z\beta^{[0]}.And the CM‘s input is always Z\beta^{[0]}. In paper 3.1, it says "Each crossmodal attention block adapts directly from the low-level feature sequence (i.e., Z\beta^{[0]} in Figure 3(b))" and" We leave the empirical study for adapting from intermediate-level features(i.e., Z\beta^{[i-1]})". what is the Z\beta^{[i-1] means and how to get it? Do you mean Z{A->L}^{[i]} or Z{V->L}^{[i]} become the CM attention's key and value? Thanks for your patience!
You will have to use a very different implementation scheme than we did to do that adaptation. Specifically, you will need to do the forward pass of all the crossmodal transformers in parallel. In that case, you can simply fuse the hidden units of different such transformers to get Z\beta^{[i]}. For example, you can define Z{A->L}^{[i]} + Z_{V->L}^{[i]} = Z_L^{[i]}. There may be other configurations, too.
In the paper 3.1 , the source modal use $Z{\beta}^{[0]}$ and the intermediate-level use $Z{\beta}^{[i-]}$ . How to get the $Z_{\beta}^{[i-]}$ and what it is mean? Looking forward for your reply.Thanks!