I'm planning to do secondary pretraining Llama-2-7b on my language with a node of 8x H100 80GB. So given that I'm given enough time and resource, do you recommend:
Training on both English and my language (I'm aiming for ratio 1:1) so that the model do not forget English knowledge?
Since extending vocab (30k -> 55k) means introducing many newly initiated weights. It makes sense to me to go for LoRA, because it reduces the number of trainable weights and the training would be more stable. But given that I can do full weight training, do you recommend I do it?
Dependencies (must be provided for code-related issues)
We lack supporting experimental results on this matter. Further testing of the effectiveness or referencing other literature is needed.
LoRA has a faster training speed and requires less GPU memory compared to the full-params training. It allows for rapid iteration to validate experimental results. Choosing full-params training after confirming the experimental plan may yield better results.
Check before submitting issues
Type of Issue
Model training and fine-tuning
Base Model
Others
Operating System
Linux
Describe your issue in detail
I'm planning to do secondary pretraining Llama-2-7b on my language with a node of 8x H100 80GB. So given that I'm given enough time and resource, do you recommend:
Dependencies (must be provided for code-related issues)
Execution logs or screenshots