Closed G-JWLee closed 1 year ago
Thanks! I have looked-into and tried to fix the issues on 3, 4, and 6. I am looking into the rest of the issues to see if current code makes sense. Will let you know when I finish. Let me know if you have any questions.
I have a question about newly updated pretraining version. I uses checkpoint of vit_mae_base model. In the checkpoint, it contains pos_emb module and patch_embed module. However, since TVLT model has name pos_emb_v, pos_emb_a, patch_embed_v, and patch_embed_a, it seems that TVLT does not inherit those pretrained weight at the beginning of pretraining procedure. Can I ask you why you did not inherit those pretrained weights? Thanks in advance.
pos_emb should match pos_emb_v since they are both vision pos embeddings. However, it is non-trainable sin-cos sequence. We instead initialize trainable pos embedding since we believe it will be better in multi-modal settings.
I get it. Thanks for the quick answer. Then how about patch_embed_v? The weight of it seems important to me since it is the first module that fronts input data
Since we treat image and audio as homogeneous inputs, the output normalized image/audio is a little different from the original MAE image normalization. We re-train patch_embed_v to adapt it.
Now I understand. Thank you! Could I ask if the code is finalized? I see the retrieval evaluation is modified.
I modified the retrieval eval to match the implementation as much as possible. Let me know if you have any other concerns.
Hi, thank you for your great work. I have a few questions about your implementation.
Thanks in advance.