wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
4.03k stars 1.05k forks source link

How to resume training from batch index #2565

Open anjul1008 opened 1 month ago

anjul1008 commented 1 month ago

Hi, I saw new step the checkpoint saving feature recently. I noticed training starts from batch index 0, how to resume training from the same batch index where i stopped the training?

I didn't find the direct way as it happens in pytorch dataloader. Can you provide some suggestion?

thank you.

xingchensong commented 1 month ago

its not supported yet. BTW,resume from batch index0 dose not hurt performance in our experiments

anjul1008 commented 1 month ago

i need to stop and resume multiple times in a week, and training with significant large data I will not be able to complete first epoch. Frequently stoping and resuming may hurt the performance since LR goes low with each iteration.

xingchensong commented 1 month ago

a quick solution is shuffling data.list with different seed before every resume

anjul1008 commented 1 month ago

Yeah, but we can't be sure that we iterated over the entire data in each epoch. I'm thinking of manipulating epoch parameter and iterating over small subsets as an epoch.(epoch = compete_data_epoch * #subsets)

but it's just a hack.

xingchensong commented 1 month ago

why do need to stop several times in a week

anjul1008 commented 1 month ago

Due to resource limitations.

I have observed that the training process terminates after running a few batches because one of the GPUs reaches out-of-memory (OOM) while the other GPUs still have available memory. This indicates that the batching process is not utilizing the full potential of all GPUs efficiently.

I have tried all 3 batching strategies, but the behavior remains same.

xingchensong commented 1 month ago
  1. decrease batchsize and increase accum_grad
  2. use deepspeed/fsdp
  3. turn on gradient checkpointing