yule-BUAA / MergeLM

Codebase for Merging Language Models (ICML 2024)
764 stars 44 forks source link

Llama model support #5

Closed paulcx closed 11 months ago

paulcx commented 11 months ago

Does merge_llms_instruct_math_code script can be applied for merging llms other than WizardMath? For example, how to merge two llama models?

yule-BUAA commented 11 months ago

Yes, you can merge any LLMs by the merge_llms_instruct_math_code script as long as they are fine-tuned from the same pre-trained backbone. For example, in our merge_llms_instruct_math_code script, apart from WizardMath, you can also merge WizardLM and Code Alpaca by setting merge_instruct and merge_code to True. You can even merge WizardLM, WizardMath, and Code Alpaca by setting merge_instruct, merge_math, and merge_code to True.

If you want to merge two llama models, you should 1) download the two llama model files as well as their pre-trained model files; 2) replace the original model infos with your llama models' infos (e.g., their names, file paths) into the merge_llms_instruct_math_code script. I think this will work for your case.

paulcx commented 11 months ago

a few things to confirm:

  1. merging two llama models require three models? two sft models and one pt model, right?
  2. Does merge_llms_instruct_math_code work for AWQ models?
yule-BUAA commented 11 months ago
  1. Yes. In most cases, three models (two sft models and one pt model) are required. However, if you just want to merge two llama models by average merging (i.e., set merging_method_name to average_merging), you only need two sft models since this method is simple and does not use the pt model. Otherwise, you need three models.
  2. Currently, we do not investigate the merging of AWQ models and are not sure whether merging methods would work under such a case. At least, I think the merge_llms_instruct_math_code script can be executed to merge AWQ models if their weights have the same bit-precisions. But the performance may be unpredictable.
paulcx commented 11 months ago

what's the difference of merging two w/o pt model? what's the use of pt model in merging?

yule-BUAA commented 11 months ago

Some model merging methods (e.g., Task Arithmetic and TIES-Merging) use the difference between an sft and a pt model to denote the task vector. Then, they operate on the task vector. While some methods not (e.g., Average Merging). You can refer to the references for more details.

yule-BUAA commented 11 months ago

Close this issue now.

Please feel free to reopen it when there are any further questions.