yaohungt / Multimodal-Transformer

[ACL'19] [PyTorch] Multimodal Transformer
MIT License
799 stars 149 forks source link

How to adapting intermediate-level instead of the low-level features #20

Closed zplovekq closed 4 years ago

zplovekq commented 4 years ago

In the paper 3.1 , the source modal use $Z{\beta}^{[0]}$ and the intermediate-level use $Z{\beta}^{[i-]}$ . How to get the $Z_{\beta}^{[i-]}$ and what it is mean? Looking forward for your reply.Thanks!

yaohungt commented 4 years ago

In our implementation, the residual connection in the Transformer always takes the first layer's representation. Another option is to have the residual connection take the previous layer's representation.

zplovekq commented 4 years ago

Thanks for your Reply! I mean in this formula , how could the $LN(Z_V^{[0]})$ become $LN(Z_V^{[i-1]})$ image or if i understand something wrong...?

jerrybai1995 commented 4 years ago

@zplovekq Actually, there is no Z_\beta^{[i]}. This is probably more straightforward to see in Figure 2: for instance, for the language (L) modality, there is a V->L transformer and an A->L transformer. So there is no ZL^{[i]} in this stage, but only Z{A->L} or Z_{V->L}... That's the whole point of doing cross-modal self-attention.

While you can certainly take the intermediate layers of the A->L transformer (i.e., Z{A->L}^{[i]}) and the intermediate layers of the V->L transformer (i.e., Z{V->L}^{[i]}) by intercepting the for loop (https://github.com/yaohungt/Multimodal-Transformer/blob/master/modules/transformer.py#L81), there is no guarantee that these two transformations are of the same depth. Thus, I would be cautious of defining/using Z_V^{[i]}.

Hope this helps.

zplovekq commented 4 years ago

Sincere thanks .That really helps. In the Figure 3(b) layer0,the input is Z{\alpha}^{[0]} and Z\beta^{[0]}.And the CM‘s input is always Z\beta^{[0]}. In paper 3.1, it says "Each crossmodal attention block adapts directly from the low-level feature sequence (i.e., Z\beta^{[0]} in Figure 3(b))" and" We leave the empirical study for adapting from intermediate-level features(i.e., Z\beta^{[i-1]})". what is the Z\beta^{[i-1] means and how to get it? Do you mean Z{A->L}^{[i]} or Z{V->L}^{[i]} become the CM attention's key and value? Thanks for your patience!

jerrybai1995 commented 4 years ago

You will have to use a very different implementation scheme than we did to do that adaptation. Specifically, you will need to do the forward pass of all the crossmodal transformers in parallel. In that case, you can simply fuse the hidden units of different such transformers to get Z\beta^{[i]}. For example, you can define Z{A->L}^{[i]} + Z_{V->L}^{[i]} = Z_L^{[i]}. There may be other configurations, too.