xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
76 stars 17 forks source link

End-to-end FP8 training #45

Open xrsrke opened 9 months ago

xrsrke commented 9 months ago

Notes

TODO

3outeille commented 9 months ago

@xrsrke On it