Closed songweige closed 3 years ago
My package versions are:, deepspeed==0.3.1
, triton==0.2.3
, cuda10.1
with pytorch 1.6
, though cuda10.2
with pytorch 1.7
should also work.
You can also try using gradient clipping if you haven't already: --gradient_clip_val 1
Hi @wilson1yan, thanks for the info. I was able to properly train the models with deepspeed==0.3.1
and triton==0.2.3
.
Thanks for the repo and great work!
I recently found that deepspeed has some numeric issue when using fp32, which gives imprecise forward and backward results for sparse operations, due to a recent update of triton: https://github.com/microsoft/DeepSpeed/issues/1222.
I am also not able to get the gpt part appropriately trained with sparse attention on BAIR and UCF-101 using either fp16 or fp32. On BAIR, I had exploded gradient after a few steps. On UCF-101, the loss didn't decrease. I'm not sure whether it is due to deepspeed or other bugs. For a sanity check, could you please let me know your used versions for these packages, specifically deepspeed, triton, and also cuda version. Thanks!