zinengtang / TVLT

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
MIT License
120 stars 13 forks source link

Finetuning on MOSEI but with nan output #8

Closed BDHU closed 1 year ago

BDHU commented 1 year ago

Hey Zineng, thanks for the amazing work! I tried the MOSEI finetuning script in the repo and I downloaded the MOSEI dataset per your instruction here.

The script I'm using is:

python run.py with data_root='./dataset/cmumosei/' gpus=[1] num_nodes=1 task_cls_mosei \
per_gpu_batchsize=1 num_workers=16 val_check_interval=0.2 warmup_steps=100 max_epoch=10 \
load_hub_path='TVLT.ckpt'

After I launch the script, I noticed at epoch 0, step 256, the output all becomes nan for some reason. I printed out the result at https://github.com/zinengtang/TVLT/tree/main/model/modules#L165. After a number of iterations, the output becomes something like:

{'mosei_loss': tensor(nan, device='cuda:1', grad_fn=<MseLossBackward0>), 'mosei_score': tensor([[nan]], device='cuda:1', grad_fn=<AddmmBackward0>), 'mosei_labels2': tensor([1], device='cuda:1')

This has resulted in the whole model parameter turning into nan. Are there any steps I'm missing? Any tips will be helpful! Thanks:)

Yimi81 commented 1 year ago

the same question

zinengtang commented 1 year ago

What (audio or video) sequence length did you use? Usually, the larger the more unstable. The attention mask operation here could be unstable especially in some settings like mixed precision: attn = attn.masked_fill(~mask[:, None, None, :].bool(), float('-inf')) You can chang float('-inf') to -1e8.

BDHU commented 1 year ago

Thank you for the response. Also a side question, is there any reason to choose batch size 1 (per_gpu_batchsize=1) in the finetune_mosei.sh?

zinengtang commented 1 year ago

per_gpu_batchsize=1 is the optmizal batch size to fit in the memory, which you can change. batch_size is the total batch size that combines gradient accumulation and per_gpu_batchsize * gpu_number.