DeepSpeed numeric issue with FP32

wilson1yan / VideoGPT

MIT License

968 stars 119 forks source link

DeepSpeed numeric issue with FP32 #14

Closed songweige closed 3 years ago

songweige commented 3 years ago

Thanks for the repo and great work!

I recently found that deepspeed has some numeric issue when using fp32, which gives imprecise forward and backward results for sparse operations, due to a recent update of triton: https://github.com/microsoft/DeepSpeed/issues/1222.

I am also not able to get the gpt part appropriately trained with sparse attention on BAIR and UCF-101 using either fp16 or fp32. On BAIR, I had exploded gradient after a few steps. On UCF-101, the loss didn't decrease. I'm not sure whether it is due to deepspeed or other bugs. For a sanity check, could you please let me know your used versions for these packages, specifically deepspeed, triton, and also cuda version. Thanks!

wilson1yan commented 3 years ago

My package versions are:, deepspeed==0.3.1, triton==0.2.3, cuda10.1 with pytorch 1.6, though cuda10.2 with pytorch 1.7 should also work.

You can also try using gradient clipping if you haven't already: --gradient_clip_val 1

songweige commented 3 years ago

Hi @wilson1yan, thanks for the info. I was able to properly train the models with deepspeed==0.3.1 and triton==0.2.3.