zhiyuanhubj / LongRecipe

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
https://arxiv.org/abs/2409.00509
66 stars 4 forks source link
large-language-models llm long-context-modeling pretraining

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

πŸ€— LongRecipe-Llama3-8B-128k β€’ πŸ€— LongRecipe-Qwen2-7B-128k β€’ πŸ“ƒ Paper ## Project Directory Structure ``` LongRecipe/ β”œβ”€β”€ accelerate_configs/ β”‚ β”œβ”€β”€ config_files β”œβ”€β”€ utils/ β”‚ └── preprocess_token_PI/ β”‚ β”œβ”€β”€ dataprocessor.py β”‚ └── FSProcessor.py β”‚ └── easy_context/ β”‚ β”œβ”€β”€ dist_flash_attn/ β”‚ β”œβ”€β”€ ulysses_attn/ β”‚ └── zigzag_ring_attn/ β”‚ β”œβ”€β”€ loader.py β”‚ β”œβ”€β”€ logger.py β”‚ └── preprocess_data.py β”œβ”€β”€ README.md β”œβ”€β”€ train_LR_llama3_target80k_use24k.sh β”œβ”€β”€ requirements.txt └── train.py ``` ## Reproduction: Before starting with the data preprocessing and model training, ensure that all necessary dependencies are installed. Use the following command to install the required packages: `pip install -r requirements.txt` ### Data Preprocessing (Example: Llama3) To begin, download the dataset represented by the Llama3 tokenizer from this link. After downloading, execute the following command to generate the position index files for different training approaches: ``` # Command to load dataset and generate position index files python preprocess_token_PI/dataprocessor.py ``` ### Model Training: The model training process is divided into three distinct stages to effectively extend the context window of the LLM while maintaining its original capabilities. #### Context Window Extension In the first stage, we extend the context window using a dataset containing 1.7B tokens. The following command initiates this training stage: ``` accelerate launch \ --config_file accelerate_configs/single_node.yaml \ train.py \ --batch-size 1 \ --gradient-accumulate-every 96 \ --learning-rate 5e-5 \ --epoch 1 \ --data_path $DATA_PATH_CONTEXT_EXTENSION \ --output-dir ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL \ --seed 2027 \ --model $MODEL \ --seq-length $SEQ_LENGTH \ --target-length $TARGET_LENGTH \ --log-path $SETTING-$SEQ_LENGTH-$MODEL_NAME-$SUB_LABEL.log \ --setting $SETTING \ --right_points-path $Right_Points_PATH \ --fs_PI-path $FS_PI_PATH \ --parallel_mode ulysses_attn \ --num_proc 5 \ --stage 0 ``` Arguments Explanation: * **--data_path**: Path to the dataset with Llama3-tokenized samples. * **--model**: The base model used for training. * **--seq-length**: The sequence length for training. * **--target-length**: The target context window length. * **--setting**: The training method, which could include FLT, RPES, PoSE, LongRecipe. * **--right_points-path**: Path to the PoSE right point set file. * **--fs_PI-path**: Path to the LongRecipe’s position index file. Post-training, copy the tokenizer files to the output directory and remove any unnecessary files: ``` cp $MODEL/special_tokens_map.json ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_0 cp $MODEL/tokenizer_config.json ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_0 cp $MODEL/tokenizer.json ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_0 rm ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_0/model.safetensors ``` #### Stage 2: Training Annealing In the second stage, we perform training annealing using both general and domain-specific data, gradually reducing the learning rate to zero. Approximately 100M tokens of data are used in this phase. ``` accelerate launch \ --config_file accelerate_configs/single_node_2.yaml \ train.py \ --data_path $DATA_PATH_ANNEALING \ --batch-size 1 \ --gradient-accumulate-every 96 \ --learning-rate 5e-6 \ --epoch 1 \ --output-dir ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL \ --seed 2027 \ --model $STAGE_1_MODEL \ --seq-length $SEQ_LENGTH \ --target-length $TARGET_LENGTH \ --log-path $SETTING-$SEQ_LENGTH-$MODEL_NAME-$SUB_LABEL.log \ --setting $SETTING \ --right_points-path $Right_Points_PATH \ --fs_PI-path $FS_PI_PATH \ --parallel_mode ulysses_attn \ --num_proc 10 \ --stage 1 ``` Copy the updated tokenizer files to the output directory: ``` cp $MODEL/special_tokens_map.json ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_1 cp $MODEL/tokenizer_config.json ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_1 cp $MODEL/tokenizer.json ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_1 rm ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL/stage_1/model.safetensors ``` In our experiment, we merge the two datasets mentioned together in out paper, and format each sample as follows: ``` { "prompt": , "response": } ``` #### Stage 3: Model Merge The final stage involves merging the original model with the fine-tuned model using an average weight strategy to enhance the model's foundational capabilities. ``` accelerate launch \ --config_file accelerate_configs/single_node.yaml \ train.py \ accelerate launch \ --config_file accelerate_configs/single_node.yaml \ train.py \ --output-dir ./output/$MODEL_NAME-$SETTING-$SEQ_LENGTH-$SUB_LABEL \ --seed 2027 \ --model $MODEL \ --log-path $SETTING-$SEQ_LENGTH-$MODEL_NAME-$SUB_LABEL.log \ --stage 2 ``` You can also run ``` bash ./train_scirpts/train_LR_llama3_target80k_use24k.sh ``` after preprocess your data to do the three stage in one command. ## Citation If you find this repo helpful, please cite our paper as follows: ``` @article{hu2024longrecipe, title={LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models}, author={Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, Bryan Hooi}, journal={arXiv preprint arXiv:2409.00509}, year={2024} } ```