Cannot reproduce the result of LoftQ on gsm8k with llama2-7b

Hi,

I try to use this code to test the performance of LoftQ:

python test_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --batch_size 16

The final ACC is about 0.30. The output information is:

prediction [18.0, 4.0, 68000.0, 540.0, 3.5, 128.0, 260.0, 32.0, 500.0, 412.0, 366.0, 8184.0, 83.0, 8.0, 5.0, 3835.0]
ground truth [18.0, 3.0, 70000.0, 540.0, 20.0, 64.0, 260.0, 160.0, 45.0, 460.0, 366.0, 694.0, 13.0, 18.0, 60.0, 125.0, 230.0, 57500.0, 7.0, 6.0, 15.0, 14.0, 7.0, 8.0, 26.0, 2.0, 243.0, 16.0, 25.0, 104.0, 109.0, 80.0, 35.0, 70.0, 23.0, 9.0, 75.0, 2.0, 10.0, 18.0, 8.0, 200.0, 26.0, 48.0, 20.0, 104.0, 163.0, 800.0, 8.0, 30.0, 294.0, 5.0, 15.0, 40.0, 40.0, 14.0, 3.0, 83.0, 57.0, 187.0, 17.0, 1430.0, 25000.0, 1596.0, 300.0, 36.0, 48.0, 595.0, 36.0, 60.0, 7425.0, 60.0, 221.0, 255.0, 88.0, 60.0, 5.0, 100.0, 6.0, 70.0, 10.0, 17.0, 623.0, 600.0, 15.0, 44.0, 22.0, 9360.0, 8000.0, 24.0, 225.0, 28.0, 4.0, 36.0, 348.0, 40.0, 3.0, 12.0, 5.0, 58.0, 175.0, 6.0, 26.0, 140.0, 500.0, 20.0, 72.0, 3.0, 50.0, 28.0, 45.0, 16.0, 24.0, 25.0, 6.0, 90.0, 42.0, 360.0, 4.0, 95200.0, 240.0, 27.0, 48.0, 50.0, 10.0, 10.0, 82.0, 120.0, 880.0, 10000.0, 30.0, 940.0, 60.0, 13.0, 720.0, 40.0, 6.0, 29.0, 105.0, 70.0, 20.0, 400.0, 140.0, 16.0, 20.0, 4000.0, 2125.0, 75.0, 30.0, 16.0, 4.0, 5.0, 4.0, 48.0, 272.0, 280.0, 1400.0, 80.0, 34.0, 15.0, 16.0, 32.0, 92.0, 50.0, 15.0, 77.0, 5.0, 16.0, 18.0, 120.0, 150.0, 1210.0, 51.0, 18000.0, 95.0, 15.0, 100.0, 350.0, 122.0, 130.0, 20.0, 160.0, 23.0, 2.0, 25.0, 30.0, 5.0, 106.0, 50.0, 34.0, 360.0, 5.0, 91.0, 24.0, 10.0, 12.0, 120.0, 6277.0, 320.0, 7500.0, 55.0, 114200.0, 100.0, 31.0, 98.0, 98.0, 860.0, 2600.0, 76.0, 145.0, 10.0, 4.0, 5.0, 250.0, 8.0, 44.0, 220.0, 15.0, 45.0, 54.0, 70.0,...]
adapter: None | GSM8K test accuracy: 0.30% | full precision: False

May I ask whether there is special setting I need to focus?

Best

yxli2123 / LoftQ

Cannot reproduce the result of LoftQ on gsm8k with llama2-7b #32