yegcjs / mixinglaws

88 stars 6 forks source link

Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance

Code and data for "Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance"

Data Mixing Laws

We include the codes to reproduce experiments and figures to discover data mixing laws in

Prediction Pipeline

Our full prediction pipeline can be reproduced with

cd pipeline
bash run.sh

Citation

@article{ye2024datamixinglaws,
  title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
  author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2403.16952},
  year={2024}
}