Closed BDHU closed 1 year ago
the same question
What (audio or video) sequence length did you use? Usually, the larger the more unstable.
The attention mask operation here could be unstable especially in some settings like mixed precision:
attn = attn.masked_fill(~mask[:, None, None, :].bool(), float('-inf'))
You can chang float('-inf') to -1e8.
Thank you for the response. Also a side question, is there any reason to choose batch size 1 (per_gpu_batchsize=1) in the finetune_mosei.sh?
per_gpu_batchsize=1 is the optmizal batch size to fit in the memory, which you can change. batch_size is the total batch size that combines gradient accumulation and per_gpu_batchsize * gpu_number.
Hey Zineng, thanks for the amazing work! I tried the MOSEI finetuning script in the repo and I downloaded the MOSEI dataset per your instruction here.
The script I'm using is:
After I launch the script, I noticed at epoch 0, step 256, the output all becomes
nan
for some reason. I printed out the result at https://github.com/zinengtang/TVLT/tree/main/model/modules#L165. After a number of iterations, the output becomes something like:This has resulted in the whole model parameter turning into
nan
. Are there any steps I'm missing? Any tips will be helpful! Thanks:)