Closed yngtodd closed 6 years ago
Adds fault tolerant checkpointing! Now we can resume from previous distributed runs, even if some of the ranks failed. 🔥
Adds fault tolerant checkpointing! Now we can resume from previous distributed runs, even if some of the ranks failed. 🔥