yuhuixu1993 / qa-lora

Official PyTorch implementation of QA-LoRA
MIT License
117 stars 11 forks source link

cannot reproduce the MMLU accuracy claimed in paper, could you release the script? #28

Open wenjingk-xilinx opened 7 months ago

wenjingk-xilinx commented 7 months ago

Hi, I try to reproduce the llama7b finetune on MMLU, I get the maximum 5-shot eval accuracy is 36.5% with 4bits, 32.8% with 3bits. Could you please kindly specify your training configuration or release the training script?

xxw11 commented 7 months ago

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

freeSoul-SNU commented 2 weeks ago

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

@xxw11 Hi. I checked the results without merging the LoRA adapters into the quantized parameters. I used the quantized Llama7B model with 4-bit quantization on the Alpaca dataset.

For the MMLU evaluation, the results were as follows:

Regarding the default parameters in the code, I used a learning rate of 0.0002, which was different from the paper's setting of 0.00002. I set the batch size to 16 and the gradient accumulation step to 1. I also uninstalled the Triton package as suggested in another issue.

I used the default parameters in the code, but I could not replicate the performance reported in the paper. As a test, I also tried training with the learning rate mentioned in the paper (0.00002), but the results got even worse. Notably, the MMLU loss increased as training progressed.

Do you have any tips on what might be causing this?

Thanks.

xxw11 commented 2 weeks ago

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

@xxw11 Hi. I checked the results without merging the LoRA adapters into the quantized parameters. I used the quantized Llama7B model with 4-bit quantization on the Alpaca dataset.

For the MMLU evaluation, the results were as follows:

  • Humanities: 35.2 (Paper: 36.6)
  • STEM: 32.0 (Paper: 32.4)
  • Social Science: 34.7 (Paper: 44.8)
  • Other: 39.6 (Paper: 44.9)

Regarding the default parameters in the code, I used a learning rate of 0.0002, which was different from the paper's setting of 0.00002. I set the batch size to 16 and the gradient accumulation step to 1. I also uninstalled the Triton package as suggested in another issue.

I used the default parameters in the code, but I could not replicate the performance reported in the paper. As a test, I also tried training with the learning rate mentioned in the paper (0.00002), but the results got even worse. Notably, the MMLU loss increased as training progressed.

Do you have any tips on what might be causing this?

Thanks.

@freeSoul-SNU In my reproduction process, I found that using a linear learning rate, as specified in the original paper, might lead to unstable results. However, the MMLU score should be around 38-40. Your parameters are basically the same as mine; I set the batch size to 1 and the gradient accumulation step to 16, but theoretically, this shouldn't affect the final results. Did you use the MMLU evaluation repository mentioned in the QA-LoRA paper?

freeSoul-SNU commented 2 weeks ago

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

@xxw11 Hi. I checked the results without merging the LoRA adapters into the quantized parameters. I used the quantized Llama7B model with 4-bit quantization on the Alpaca dataset. For the MMLU evaluation, the results were as follows:

  • Humanities: 35.2 (Paper: 36.6)
  • STEM: 32.0 (Paper: 32.4)
  • Social Science: 34.7 (Paper: 44.8)
  • Other: 39.6 (Paper: 44.9)

Regarding the default parameters in the code, I used a learning rate of 0.0002, which was different from the paper's setting of 0.00002. I set the batch size to 16 and the gradient accumulation step to 1. I also uninstalled the Triton package as suggested in another issue. I used the default parameters in the code, but I could not replicate the performance reported in the paper. As a test, I also tried training with the learning rate mentioned in the paper (0.00002), but the results got even worse. Notably, the MMLU loss increased as training progressed. Do you have any tips on what might be causing this? Thanks.

@freeSoul-SNU In my reproduction process, I found that using a linear learning rate, as specified in the original paper, might lead to unstable results. However, the MMLU score should be around 38-40. Your parameters are basically the same as mine; I set the batch size to 1 and the gradient accumulation step to 16, but theoretically, this shouldn't affect the final results. Did you use the MMLU evaluation repository mentioned in the QA-LoRA paper?

@xxw11 Thank you very much for your response.

For the MMLU evaluation, I used the mmlu_evaluation function in qalora.py shared by the author. The dataset wasn't included in the qalora Git repository, but I found it in the following GPTQ LoRA repository and used it: https://github.com/qwopqwop200/gptqlora

I also believe that increasing the batch size to 16 with gradient_accumulation set to 1 should not make a significant difference, but the MMLU evaluation results are still very low.

I used the script for finetuning as below:

CUDA_VISIBLE_DEVICES=1 HF_DATASETS_OFFLINE=1 python qalora.py --model_path "AutoGPTQ/examples/quantization/llama7b-quant4bit-g32/" \
    --output_dir output_alpaca \
    --dataset alpaca \
    --do_eval True \
    --do_mmlu_eval True \
    --do_train True \
    --mmlu_dataset 'mmlu-fs' \
    --save_strategy 'no' \
    --save_steps 1000 \
    --max_steps 10000 \
    --optim paged_adamw_32bit \

If it’s not too much trouble, could you share the MMLU evaluation script you used?

+ Did you change the part in qalora.py where the model is loaded in float32 to float16 before training?

Thank you so much.