Questions about Secondary Pretraining strategy

staticpunch commented 9 months ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] I have read the Wiki and FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, LangChain, text-generation-webui, we recommend checking the corresponding project for solutions

Type of Issue

Model training and fine-tuning

Base Model

Others

Operating System

Linux

Describe your issue in detail

I'm planning to do secondary pretraining Llama-2-7b on my language with a node of 8x H100 80GB. So given that I'm given enough time and resource, do you recommend:

Training on both English and my language (I'm aiming for ratio 1:1) so that the model do not forget English knowledge?
Since extending vocab (30k -> 55k) means introducing many newly initiated weights. It makes sense to me to go for LoRA, because it reduces the number of trainable weights and the training would be more stable. But given that I can do full weight training, do you recommend I do it?

Dependencies (must be provided for code-related issues)

# Please copy-and-paste your dependencies here.

Execution logs or screenshots

# Please copy-and-paste your logs here.

iMountTai commented 9 months ago

We lack supporting experimental results on this matter. Further testing of the effectiveness or referencing other literature is needed.
LoRA has a faster training speed and requires less GPU memory compared to the full-params training. It allows for rapid iteration to validate experimental results. Choosing full-params training after confirming the experimental plan may yield better results.