xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
76 stars 17 forks source link

ZeRO-1 #20

Closed xrsrke closed 9 months ago

xrsrke commented 10 months ago

Only partitioning optimizer states and making it work with 3D parallelism.