Closed AntonioConsiglio closed 1 year ago
A good question. The primary consideration for concatenating visual and ID "V"s is "efficiency".
Since the attention map is shared between the visual and ID branches, we get Softmax(visual_query * visual_key) * Concat(visual_value, ID_value) = Concat(Softmax(visual_query * visual_key) * visual_value, Softmax(visual_query * visual_key) * ID_value
. But the former implementation is more memory-efficient.
Okay, but in the end isn't the output a mix of visual and ID because of DW_CONV and Fully Connected Layer (projection)?
Right, they will be mixed. I also tried to project visual and ID separately, which is more complex, but no performance gain was observed.
Reading the code I can't understand how to identify the Visual Branch and ID Branch in the GatedPropagationModule.
https://github.com/yoxu515/aot-benchmark/blob/315c62f0fc0eaaa6c26ede48f32f2ff41e54209a/networks/layers/transformer.py#L645
Based on the implementation above, you stack the Visual and ID information and then the output of the "long_term_attention" and "short_term_attention" is splitted into it seems to be Visual and ID. But why splitting the output of GP function we have Visual and ID info if they have been computed toghether in that modules?