whyNLP / LCKV

Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance. Accepted to ACL 2024.
https://arxiv.org/abs/2405.10637
139 stars 6 forks source link

What is the specific version of the Python this project running on? #10

Closed ChenHong30 closed 1 week ago

ChenHong30 commented 2 weeks ago

Hi, thank you for your awesome work. I am trying to rebuild your project on my local machine, but it seems there were some troubles with the environment construction, especially with installing the following dependencies:

pip install deepspeed pip install git+https://github.com/Dao-AILab/flash-attention@v2.3.6#subdirectory=csrc/layer_norm

Unfortunately, I do not mark the error messages at that time, so I cannot show them to you here. However, the problem may be caused by the Python version and the platform.

I have tried Python 3.8 and Python 3.11 for installing, but the problem is the same.

In addition, the project on my local machine was deployed on Windows, which may be a factor. I will use Ubuntu later to see whether the problem has been solved. But first, I would like to know the Python version and other special but not mentioned settings.

Very thank you much for your kind help.

why-in-Shanghaitech commented 2 weeks ago

Hi! Thank you for the question. It seems that you are working on the dev-lckv-publish branch. I use python 3.9.18 when developing these codes, and use python 3.10.6 for the large-scaled experiments on A800. All under Ubuntu.

I guess it is hard to install the layer norm dependency on windows since it requires to compile cuda codes (not sure). Actually, these dependencies are not necessary. Just ignore them, launch experiments with accelerate launch and do not enable the environment variable LCKV_FUSED_RMSNORM. It will be a little bit slower, but does not affect much.

Another thing is that the codes in dev-lckv-publish have been deprecated and we have switched to the main branch. I believe the codes in main branch could have better compatibility. It will no longer depend on cuda codes, instead we use triton kernels.

ChenHong30 commented 2 weeks ago

Thank you so much for your fast reply, I have switched the branch and applied your settings, it works for me!

However, I have noticed your latest paper A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference, in the appendix there was Figure 2 which describes the performance of TinyLlama which only uses the framework you proposed on inference stage. That is an awesome work.

I want to try applying more KV layers and intend to rebuild the performance and see what the curve is going on. Unfortunately, I was a little confused about the execution order you posted on GitHub. That might be my coding level limited :( Therefore, can you tell me what I should do to realize it?

Thanks for your help~

why-in-Shanghaitech commented 2 weeks ago

Thank you for the reply!

However, I have noticed your latest paper A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference, in the appendix there was Figure 2 which describes the performance of TinyLlama which only uses the framework you proposed on inference stage. That is an awesome work.

There might be some misunderstandings here: The experiment you mentioned requires training on the Minipile dataset. This experiment utilizes the pretrained TinyLlama weights to initialize the model so that it could train much faster. The method is similar to MLKV.

I want to try applying more KV layers and intend to rebuild the performance and see what the curve is going on. Unfortunately, I was a little confused about the execution order you posted on GitHub. That might be my coding level limited :( Therefore, can you tell me what I should do to realize it?

Sure, just follow the steps here. Sorry that the readme only briefly introduces the usage; to learn more, one could use the --help command line argument.

Take Tinyllama as an example. Given a prepared environment, run the following commands:

python convert_pretrained.py \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
    --config_name configs/tinyllama_lckv.json \
    --config_overrides layer_types=0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21 \
    --output_dir outputs/tinyllama-converted

accelerate launch run_clm.py \
    --model_name_or_path outputs/tinyllama-converted \
    --dataset_name JeanKaddour/minipile \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --auto_find_batch_size \
    --gradient_accumulation_steps 1 \
    --block_size 2048 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.015 \
    --learning_rate 3e-4 \
    --weight_decay 1e-1 \
    --bf16 \
    --torch_dtype bfloat16 \
    --do_train \
    --do_eval \
    --use_liger_kernel \
    --num_train_epochs 3 \
    --save_total_limit 1 \
    --save_strategy steps \
    --save_steps 500 \
    --evaluation_strategy steps \
    --eval_steps 500 \
    --load_best_model_at_end True \
    --metric_for_best_model eval_loss \
    --report_to none \
    --run_name tinyllama-test \
    --overwrite_output_dir \
    --output_dir outputs/tinyllama-test

Feel free to change --config_overrides layer_types=0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21 as you like.

ChenHong30 commented 2 weeks ago

Thanks, I understand how the process is going on. That is, there is no way to directly evaluate a pre-trained model applying LCKV without re-training, which is:

# The same as you posted
python convert_pretrained.py \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
    --config_name configs/tinyllama_lckv.json \
    --config_overrides layer_types=0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21 \
    --output_dir outputs/tinyllama-converted

# comment the line **--do_train \**
accelerate launch run_clm.py \
    --model_name_or_path outputs/tinyllama-converted \
    ...
    --torch_dtype bfloat16 \
 #   --do_train \
    ...
    --output_dir outputs/tinyllama-test

The above code is wrong, which leads to a terrible result as I tried.

The correct process should be:

  1. Download the pre-trained model from huggingface
  2. Use python convert_pretrained.py... command to convert the model to match LCKV
  3. Use full command from accelerate launch run_clm.py \... to re-train and evaluate the model
  4. For more evaluation, use python test_harness.py --model_name_or_path ... and python test_latency.py with corresponding re-trained model.

If there is any misunderstanding, please correct it. Thanks again for your support, I realize I'm getting closer to replicating your experiment ^-^.

why-in-Shanghaitech commented 2 weeks ago

Thanks for the quick reply,

That is, there is no way to directly evaluate a pre-trained model applying LCKV without re-training

I did not try to do this... But such method may exist. You might be interested in this work: KVSharer.

The correct process should be:

  1. Download the pre-trained model from huggingface
  2. Use python convert_pretrained.py... command to convert the model to match LCKV
  3. Use full command from accelerate launch run_clm.py \... to re-train and evaluate the model
  4. For more evaluation, use python test_harness.py --model_name_or_path ... and python test_latency.py with corresponding re-trained model.

Yes. The download can be automatic when executing the following commands.

ChenHong30 commented 2 weeks ago

Great, I understand the process with your help. but the training and testing time is much longer than I expected. It seems to take over 50 hours to execute the code above (2 * RTX 3090 24GB). I wonder if it is a normal situation or if I set some wrong configuration. May I know how long it takes to train a 1B version model (Just use the code you provided above)?

why-in-Shanghaitech commented 1 week ago

Thank you for the reply. I am surprised that you only need 50 hours of training with 2 RTX3090... This command will train a 1.1B model with about 5.1B tokens and the estimated time for training will be over 300 hours with the same cards from my side. A standard Llama model will take over 100 hours.

ChenHong30 commented 1 week ago

Alright, I did not wait for the whole process, it is just an estimated time given by the program, so I think you are right... But what for the 110M version, did you use this one? And how long it took in what GPU? For some reason I had to download the model manually, thus I wanna make sure I don't do the wrong thing, thanks.

why-in-Shanghaitech commented 1 week ago

Erm... No, we did not use any pre-trained model for 110M size. We just train from scratch. And the configuration is at https://github.com/whyNLP/LCKV/blob/main/configs/llama_small_lckv.json (slightly different from the link you provided)

The estimated time to train on wikitext-103 w/ 1 epoch (140M tokens) w/ 2 RTX3090 is about 50mins from my side. For more 110M experiment details I think Wu You is the best to answer these questions. cc @pigdogbaby

pigdogbaby commented 1 week ago
Hi! Thank you for your attention. Here is the total training time of 110M models on Minipile (1.7B tokens) w/ 2 epochs w/ 1 RTX3090. The training time of our nine configurations is almost constant with the number of KV layers, so we report that of the models with 6 KV layers. Configuration Training Time (s)
Llama 56752.5534
Pizza-Bottom 53322.6042
Sandwich-Bottom 55223.4724
Lasagna-Bottom 55140.7039
Pizza-Top 124809.1968
Sandwich-Top 122462.5608
Lasagna-Top 174526.2895
Pizza-Middle 96269.3851
Sandwich-Middle 95579.8553
Lasagna-Middle 166053.6448
ChenHong30 commented 1 week ago

I understand, thanks.

So the 110M and 1.1B version models based on Llama2-7b-hf? And train them from scratch. That is I need to run:

python convert_pretrained.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --config_name configs/llama_small_lckv.json \
    --output_dir outputs/llama-converted-110M-untrained

Then for training:

accelerate launch run_clm.py \
    --model_name_or_path llama-converted-110M-untrained \
    --dataset_name JeanKaddour/minipile \
    ...
    --run_name llama-110M \
    --overwrite_output_dir \
    --output_dir outputs/llama-110M

Cuz if I use the tinyllama 1.1B as a basic model to convert, an error message occurs when converting.

RuntimeError: The size of tensor a (768) must match the size of tensor b (2048) at non-singleton dimension 1

The 768 and 2048 here is quite similar to the hidden size.

pigdogbaby commented 1 week ago

Thank you for the reply.

So the 110M and 1.1B version models based on Llama2-7b-hf?

Neither the 110M model nor the 1.1B model is based on Llama2-7b-hf. Maybe "_name_or_path": "meta-llama/Llama-2-7b-hf" at https://github.com/whyNLP/LCKV/blob/main/configs/llama_small_lckv.json is a little confusing. This is just a placeholder which will always be overwritten. If you want to train a 110M model from scratch, there is no need to run python convert_pretrained.py .... You only need to run:

accelerate launch run_clm.py \
    --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T \
    --config_name configs/llama_small_lckv.json \
    --dataset_name JeanKaddour/minipile \
    ...
    --run_name llama-110M \
    --overwrite_output_dir \
    --output_dir outputs/llama-110M

which is similar to https://github.com/whyNLP/LCKV/blob/main/run_clm.sh.

The 768 and 2048 here is quite similar to the hidden size.

You are right. The error occurs because of the mismatch of the hidden sizes of the 110M and 1.1B models. Tinyllama-1.1B is not an appropriate basic model to initialize our 110M model. We have not found any pre-trained models which can be used to initialize our 110M model.

ChenHong30 commented 1 week ago

I didn't expect you to reply so quickly, thanks a lot.

After reading the run_clm.sh I found the correct way to train a model from scratch, just the same as you mentioned.

However, can I use the same code to train the 1.1B model with just modify the --config_name configs/llama_small_lckv.json \ to --config_name configs/tinyllama_lckv.json \?

Or I should change the --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T \ to --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \ as well?

pigdogbaby commented 1 week ago

It's my pleasure.

If you want to train a 1.1B model from scratch, just modify --config_name configs/llama_small_lckv.json \ to --config_name configs/tinyllama_lckv.json \ is enough.

There is no need to modify --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T \ to --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \. The two tokenizers are actually identical and you can use either one.