wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
4.18k stars 1.08k forks source link

How to resume training from batch index #2565

Closed anjul1008 closed 2 months ago

anjul1008 commented 4 months ago

Hi, I saw new step the checkpoint saving feature recently. I noticed training starts from batch index 0, how to resume training from the same batch index where i stopped the training?

I didn't find the direct way as it happens in pytorch dataloader. Can you provide some suggestion?

thank you.

xingchensong commented 4 months ago

its not supported yet. BTW,resume from batch index0 dose not hurt performance in our experiments

anjul1008 commented 4 months ago

i need to stop and resume multiple times in a week, and training with significant large data I will not be able to complete first epoch. Frequently stoping and resuming may hurt the performance since LR goes low with each iteration.

xingchensong commented 4 months ago

a quick solution is shuffling data.list with different seed before every resume

anjul1008 commented 4 months ago

Yeah, but we can't be sure that we iterated over the entire data in each epoch. I'm thinking of manipulating epoch parameter and iterating over small subsets as an epoch.(epoch = compete_data_epoch * #subsets)

but it's just a hack.

xingchensong commented 4 months ago

why do need to stop several times in a week

anjul1008 commented 4 months ago

Due to resource limitations.

I have observed that the training process terminates after running a few batches because one of the GPUs reaches out-of-memory (OOM) while the other GPUs still have available memory. This indicates that the batching process is not utilizing the full potential of all GPUs efficiently.

I have tried all 3 batching strategies, but the behavior remains same.

xingchensong commented 4 months ago
  1. decrease batchsize and increase accum_grad
  2. use deepspeed/fsdp
  3. turn on gradient checkpointing