xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
76 stars 17 forks source link

Mixed precision training in FP16 #14

Open xrsrke opened 10 months ago

xrsrke commented 10 months ago

TODOs

APIs

import torch
import pipegoose

# other parallelism...
scaler = pipegoose.amp.GradScaler()

with pipegoose.amp.autocast(parallel_context, dtype=torch.float16):
    outputs = model(**inputs, labels=labels)
    loss = loss_func(outputs, targets)

scaled_loss = scaler.scale(loss)

optim.zero_grad()
scaled_loss.backward()
scaler.step(optimizer)
scaler.update() # updates the scale for next iteration

Reading List