Open mohammadaminabbasi opened 1 year ago
As described in this issue, the configurations would be similar to the pretraining on TPU pods, with the additional jax distributed configurations. However, you'll have to tune the mesh shape and batch size yourself according to the configuration of your own cluster in order to obtain the best throughput. Unfortunately I don't have access to a few hundred A100s so I cannot provide a good example for that.
I would like to request information regarding the pretraining configuration of LlaMa on the A100 80G GPU for my project. As I am planning to use this setup for my research, having access to the specific pretraining configuration details would greatly help me in replicating and benchmarking the results and best speed.