zinengtang / TVLT

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
MIT License
120 stars 13 forks source link

Downstream task Cosine scheduler #4

Closed G-JWLee closed 1 year ago

G-JWLee commented 1 year ago

Hi, thank you for your great work. I have a few questions about your implementation.

  1. In your code, there seems to be no cosine scheduler. However, in your paper, it says that you used cosine scheduler during finetuning. Which one is correct? Also, did you mean CosineAnnealingScheduler or just cosine scheduler that slowly decrease LR?
  2. Also, warmup_steps=1000 seems to have no effect. Could you correct this?
  3. In paper, you said you used 6k MSR-VTT train samples. However, in code, it uses 9k train samples by referencing 'train_list_jsfusion.txt' file in MSR-VTT. Did you manually exclude them?
  4. In function compute_vrar_recall() in obejctives.py, it used Distributed sampler, hence 250 out of 1000 results are stored in the rank_scores list when using 4 GPUs, thus it causes error. Could you fix it?
  5. Also, does the audio-video retrieval task is reproducible in your code's setting? How many epochs did you run? I keep failing in reproducing audio-video retrieval with MSR-VTT dataset.
  6. All finetuning and pretraining refers to load_hub_path='TVLT.ckpt' pretrained weight. In your paper, it says it uses weights pretrained on ImageNet when pretraining model with Howto100M. Could you fix it?

Thanks in advance.

zinengtang commented 1 year ago

Thanks! I have looked-into and tried to fix the issues on 3, 4, and 6. I am looking into the rest of the issues to see if current code makes sense. Will let you know when I finish. Let me know if you have any questions.

G-JWLee commented 1 year ago

I have a question about newly updated pretraining version. I uses checkpoint of vit_mae_base model. In the checkpoint, it contains pos_emb module and patch_embed module. However, since TVLT model has name pos_emb_v, pos_emb_a, patch_embed_v, and patch_embed_a, it seems that TVLT does not inherit those pretrained weight at the beginning of pretraining procedure. Can I ask you why you did not inherit those pretrained weights? Thanks in advance.

zinengtang commented 1 year ago

pos_emb should match pos_emb_v since they are both vision pos embeddings. However, it is non-trainable sin-cos sequence. We instead initialize trainable pos embedding since we believe it will be better in multi-modal settings.

G-JWLee commented 1 year ago

I get it. Thanks for the quick answer. Then how about patch_embed_v? The weight of it seems important to me since it is the first module that fronts input data

zinengtang commented 1 year ago

Since we treat image and audio as homogeneous inputs, the output normalized image/audio is a little different from the original MAE image normalization. We re-train patch_embed_v to adapt it.

G-JWLee commented 1 year ago

Now I understand. Thank you! Could I ask if the code is finalized? I see the retrieval evaluation is modified.

zinengtang commented 1 year ago

I modified the retrieval eval to match the implementation as much as possible. Let me know if you have any other concerns.