Question about "Multi-modal Prediction Model as the Oracle"

wz7in / CVPR2023-VLSAT

CVPR2023 : VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud

61 stars 11 forks source link

Question about "Multi-modal Prediction Model as the Oracle" #12

Closed Haerxu closed 4 months ago

Haerxu commented 1 year ago

Hi,

Thanks for releasing code! I have a question about section "Multi-modal Prediction Model as the Oracle". It says "To be specific, these collaboration operations are implemented by multi-head cross-attention (MHCA) modules [33], where the keys and values are node/edge features from the 3D model, and the queries are their counterparts from the multi-modal model."

Why 3D model is the key and value? I thought it should be query since the 3D model is weaker than the oracle model.

wz7in commented 4 months ago

To build a comprehensive network, it is essential to supplement the oracle network with information from the 3D network. Therefore, we integrate the 3D network’s information into the oracle network as values. If the 3D network acts as the query, then the oracle is essentially the 3D network. Our goal is to avoid adding extra information to the 3D network.