xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
76 stars 17 forks source link

WIP: Trainer #23

Open isamu-isozaki opened 10 months ago

isamu-isozaki commented 10 months ago

This pr is a WIP. But is the conceptual idea for Issue #18

isamu-isozaki commented 10 months ago

some issues so far is that for dataloader, the default transformer's trainer handles it in get_train_dataloader function which handles parallelizing via

return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))

so we might want to overwrite it for now but in future have api compatible with accelerate

xrsrke commented 10 months ago
return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))

so we might want to overwrite it for now but in future have api compatible with accelerate

@isamu-isozaki Thank you so much for the PR. This is also the reason we don't want to use Trainer from transformers. Because we implement our own 3D parallelism, we don't want it to be wrapped by accelerate

isamu-isozaki commented 10 months ago

@xrsrke Sounds good. Let me rewrite it to be the minimum training example. But in the future, it might be better to make it inherited, etc so that we can use the updated ver of the transformer's trainer/not having to maintain compatibility with it.

isamu-isozaki commented 10 months ago

made it a minimal version proof of concept. (not tested yet) If we want to expand more the three main options are

  1. Reimplement pretty much all of Trainer in pipegoose.
  2. Override some of the methods in Trainer to make it work

Overall, this might not be a quick pr if we do any of the 2 but ideally, if 2 works that'll be probably the best option maybe?

xrsrke commented 10 months ago

@isamu-isozaki, the PR looks great. I think we should prefer option 1 because one potential direction for the future is that we will support the parallelization of any arbitrary transformer torch module, not just transformers. Since transformers is a hub where people push a trained model, our library is the one that people start training from scratch. I also recommend checking out the Lightning trainer [link]. They have excellent abstractions, like separating CallbackHandler (a thing that connects the callback and trainer), Callback, and Trainer.

Here are my learning notes on Lightning's trainer: https://projectfoundation.notion.site/Lightning-f027845e720d4f74aa876b045e58669b. They could be helpful for you

xrsrke commented 10 months ago

I will assign the task to you! Thank you. Sometimes, we also hold discussions on our Discord. Do you have a Discord account? https://discord.gg/nSyGZB6Gpp

isamu-isozaki commented 10 months ago

Ah sounds good. How much features do you want for the initial Trainer. Like do you have some tests in mind for preliminary use? And haha I think we are already friends on discord. Let me send a message