ZeRO-1 - Githubissues

Only partitioning optimizer states and making it work with 3D parallelism.

[x] Partition optimizer states across data parallelism dimension
[x] Average the gradients of sharded parmameters in data paralllelism (DataParallle do this)
[x] Use the averaged gradients to update the corresponding optimizer states
[x] Use the updated optimizer states to update the sharded parameters
[x] Test it with data parallelism

xrsrke / pipegoose