Closed sklee2014 closed 3 years ago
Hi, thanks for your interest in our work.
Our experiments use cross-modal interactions to reduce computation and increase interpretability (e.g. the visualization). If the focus is on improving the performance, I guess adding interactions after getting sequence of features of each modality would be more effective.
Hi, thank you for the great work! I have a question about the model name in the paper and your implementation. Do MME2E and MME2E_Sparse in the code correspond to FE2E and MESM respectively? Also, if that's the case, I wonder why FE2E works better than MESM (Table 3. and 4. in the paper) even with less cross-modal interaction. (as MME2E does not have cross-modal operation other than multimodal fusion in the final layer.. probably because of the FLOPs..?) Thank you!