yoxu515 / aot-benchmark

An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch
BSD 3-Clause "New" or "Revised" License
607 stars 108 forks source link

Gated Propagation Module - ID Branch and LT/ST propagation #58

Closed AntonioConsiglio closed 1 year ago

AntonioConsiglio commented 1 year ago

Reading the code I can't understand how to identify the Visual Branch and ID Branch in the GatedPropagationModule.

image

https://github.com/yoxu515/aot-benchmark/blob/315c62f0fc0eaaa6c26ede48f32f2ff41e54209a/networks/layers/transformer.py#L645

Based on the implementation above, you stack the Visual and ID information and then the output of the "long_term_attention" and "short_term_attention" is splitted into it seems to be Visual and ID. But why splitting the output of GP function we have Visual and ID info if they have been computed toghether in that modules?

z-x-yang commented 1 year ago

A good question. The primary consideration for concatenating visual and ID "V"s is "efficiency".

Since the attention map is shared between the visual and ID branches, we get Softmax(visual_query * visual_key) * Concat(visual_value, ID_value) = Concat(Softmax(visual_query * visual_key) * visual_value, Softmax(visual_query * visual_key) * ID_value. But the former implementation is more memory-efficient.

AntonioConsiglio commented 1 year ago

Okay, but in the end isn't the output a mix of visual and ID because of DW_CONV and Fully Connected Layer (projection)?

https://github.com/yoxu515/aot-benchmark/blob/315c62f0fc0eaaa6c26ede48f32f2ff41e54209a/networks/layers/attention.py#L701

https://github.com/yoxu515/aot-benchmark/blob/315c62f0fc0eaaa6c26ede48f32f2ff41e54209a/networks/layers/attention.py#L702

z-x-yang commented 1 year ago

Right, they will be mixed. I also tried to project visual and ID separately, which is more complex, but no performance gain was observed.