xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
76 stars 17 forks source link

[Fix] Name generalization of transformer blocks #44

Closed abourramouss closed 9 months ago

abourramouss commented 9 months ago

As @danielgrittner pointed out, some models can have different naming conventions but follow the same pattern, this pr fixes the issue by using the base_model_prefix and finding the transformer block name, instead of always using transformer_h_X.

Tests work the same as they did, as this doesn't affect the workings of the partitioning, we just are generalizing the way to identify transformer blocks.