DiLoCo replication (DiLoCo: Distributed Low-Communication Training of Language Models)

xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*

MIT License

76 stars 17 forks source link

Open xrsrke opened 9 months ago