Mixture of Experts - Githubissues

APIs

from pipegoose.nn.expert_parallel import ExpertParallel, ExpertLoss

parallel_context = ParallelContext.from_torch(expert_parallel_size=8)

mlp = CustomExpert()
router = CustomRouter()
noise_policy = CustomNoisePolicy()
loss_func = nn.CrossEntropy()

model = ExpertParallel(
     model,
     expert=mlp,
     router=router,
     noise_policy=noise_policy,
     enable_tensor_parallelism=True,
     parallel_context=parallel_context,
).parallelize()

loss_func = ExpertLoss(loss_func, aux_weight=0.1)

TODOs

[x] Top-1, Top-2 router
[x] ExpertParallel (turn a 🤗 transformers to a MoE automatically)
[ ] Does expert embedding need to multiply its corresponding router probability?
[ ] Make ExpertParallel work with data parallelism
- [ ] Create a new process group for experts across data parallelism dimension
- [ ] Register a backward hook between the same expert across data parallelism dimension
[ ] Optionally apply tensor parallelism to an expert layer
[ ] Make ExpertParallel work in pipeline parallelism
[ ] Make ExpertParallel work with ZeRO-1
[x] Loss function (include aux and z loss)
[ ] Move inputs to target expert device

Engineering Reading

Pipeline MoE - A Flexible MoE Implementation with Pipeline Parallelism
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
DeepSpeed-TED: Tensor-Expert-Data Parallelism Optimize Hybrid: A Approach to Mixture-of-Experts Training
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
MegaBlocks - Efficient Sparse Training with Mixture-of-Experts

MoE Reading

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Mixture-of-Experts with Expert Choice Routing

xrsrke / pipegoose

Mixture of Experts #19