yegcjs / mixinglaws

88 stars 6 forks source link

Missing files in the repo #5

Open SivilTaram opened 1 month ago

SivilTaram commented 1 month ago

First of all, thank you for your excellent work on this project! I would like to utilize the script provided in find_opt.py to determine the optimal data mixture ratio for certain cases, such as larger model size or increased step size. To accomplish this, I believe it would rely on the get_loss.py script to obtain the loss for the larger models.

However, these scripts all depend on the STEPLAW_FILES & SIZELAW_FILES variables, which are currently missing from the existing repository. Would you please be able to upload these two necessary files?

Additionally, if you could provide the relevant files for the Pile dataset, that would be greatly appreciated.

yegcjs commented 1 week ago

Thanks for your interest in our work!

The variables you mention are defined in utils.py, which are generated by steplaw.py and sizelaw.py. See run.sh for the full pipeline example.

For the Pile, we obtain the validation set from https://huggingface.co/datasets/mit-han-lab/pile-val-backup.