Closed anjul1008 closed 2 months ago
its not supported yet. BTW,resume from batch index0 dose not hurt performance in our experiments
i need to stop and resume multiple times in a week, and training with significant large data I will not be able to complete first epoch. Frequently stoping and resuming may hurt the performance since LR goes low with each iteration.
a quick solution is shuffling data.list with different seed before every resume
Yeah, but we can't be sure that we iterated over the entire data in each epoch. I'm thinking of manipulating epoch parameter and iterating over small subsets as an epoch.(epoch = compete_data_epoch * #subsets)
but it's just a hack.
why do need to stop several times in a week
Due to resource limitations.
I have observed that the training process terminates after running a few batches because one of the GPUs reaches out-of-memory (OOM) while the other GPUs still have available memory. This indicates that the batching process is not utilizing the full potential of all GPUs efficiently.
I have tried all 3 batching strategies, but the behavior remains same.
Hi, I saw new step the checkpoint saving feature recently. I noticed training starts from batch index 0, how to resume training from the same batch index where i stopped the training?
I didn't find the direct way as it happens in pytorch dataloader. Can you provide some suggestion?
thank you.