xrsrke / pipegoose

Large scale 4D parallelism pre-training for šŸ¤— transformers in Mixture of Experts *(still work in progress)*
MIT License
76 stars 17 forks source link

Mixture of Experts #19

Open xrsrke opened 10 months ago

xrsrke commented 10 months ago

APIs

from pipegoose.nn.expert_parallel import ExpertParallel, ExpertLoss

parallel_context = ParallelContext.from_torch(expert_parallel_size=8)

mlp = CustomExpert()
router = CustomRouter()
noise_policy = CustomNoisePolicy()
loss_func = nn.CrossEntropy()

model = ExpertParallel(
     model,
     expert=mlp,
     router=router,
     noise_policy=noise_policy,
     enable_tensor_parallelism=True,
     parallel_context=parallel_context,
).parallelize()

loss_func = ExpertLoss(loss_func, aux_weight=0.1)

TODOs

Engineering Reading

MoE Reading