Save and load checkpoints

Notes

Suppose we previously trained a model and saved a checkpoint using a configuration with tensor_parallel_size=2 and pipeline_parallel_size=4. Now we want to load this checkpoint and continue training, but with a new configuration that has tensor_parallel_size=4 and pipeline_parallel_size=3.
With merge=True, instead of each rank saving its corresponding partitions, now all checkpoints are merged into a single file and saved in a format that both an unparallelized model and a parallelized model can use to load that checkpoint.
With save_config=True, all configuration like tensor_parallel_size, pipeline_parallel_size, and arguments in these XParallel and DistributedOptimizer classes are saved if present.

APIs

# save checkpoints of a parallelized model
model.save_pretrained(
    save_directory="./checkpoints",
    save_config=True, # default
    save_function=torch.save, # default
    merge_checkpoints=True, # False by default
)

# load checkpoints from a parallelized model
model.from_parallelized(path="./checkpoints")

xrsrke / pipegoose

Save and load checkpoints #29