wy1iu / butterfly-oft

Official implementation of "Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization"
70 stars 0 forks source link

number of trainable parameters doesn't match the paper #2

Open BaohaoLiao opened 2 months ago

BaohaoLiao commented 2 months ago

Hi,

thank you for this inspiring work!

I'm reproducing the reported results in the paper for the GLUE benchmark with DeBERTa-v3-base and peft. Here is my settings:

## For OFT with b=16
peft_config = BOFTConfig(
                task_type="SEQ_CLS",
                inference_mode=False,
                boft_block_size=16,
                boft_n_butterfly_factor=1,
                target_modules=["query_proj", "key_proj", "value_proj", "output.dense", "intermediate.dense"],
                boft_dropout=0.1,
                init_weights=True,
            )

## For BOFT with b=8 and m=2
peft_config = BOFTConfig(
                task_type="SEQ_CLS",
                inference_mode=False,
                boft_block_size=8,
                boft_n_butterfly_factor=2,
                target_modules=["query_proj", "key_proj", "value_proj", "output.dense", "intermediate.dense"],
                boft_dropout=0.1,
                init_weights=True,
            )

I noticed that the number of trainable parameters (the classifier parameters are excluded) is 1,410,048 for OFT and BOFT. However, the reported number in [Table 1](https://arxiv.org/pdf/2311.06243) is about 0.8M.

I also calculated the number of trainable parameters by hand for BOFT as: ((2 96 8 8 + 768) 4 + (2 96 8 8 + 3072) + (2 384 8 8 + 768)) * 12 = 1,410,048

Could you tell me whether I did something wrong?

Edenzzzz commented 2 months ago

Did you count the classification head?

BaohaoLiao commented 2 months ago

Hi @Edenzzzz, As I mentioned above, the classifier parameters are excluded. You can also check my manual calculation.

For DeBERTa-v3-base, the hidden_dim=768, intermediate_dim=3072.

Edenzzzz commented 2 months ago

Oh sry. I also wonder if you're able to reproduce any of the experiment results?

BaohaoLiao commented 2 months ago

Hi @Edenzzzz , I do some experiments on RoBERTa, hope it helps you. One notable thing is that boft consumes more GPU memory than LoRA because of the calculation of inverse.

Screenshot 2024-05-04 at 15 03 47
Edenzzzz commented 2 months ago

Many thanks! I was suspecting multiplying multiple butterfly matrices may be speed and memory deficient due to storing activations. I also wonder if this is your ongoing work (Trying to borrow/cite some reproduced BOFT results)?

BaohaoLiao commented 2 months ago

Yes, it is my ongoing project for NeurIPS 2024. Unfortunately, I haven't released it.

You can use my results if you want. My experimental setting follows https://arxiv.org/pdf/2402.15179. About the reproduction of OFT and BOFT, I use the hyper-parameter in the BOFT paper. Notably, I only tune the learning rate of BOFT and OFT, and keep other hyper-parameter the same. To tune the learning rate, if the reported learning rate is 1e-4, I expand it to {8e-5 1e-4 3e-4}. If it is 3e-4, I expand it to {1e-4 3e-4 5e-4}. In sum, with about 2 as the interval.

BaohaoLiao commented 2 months ago

BTW, you are right. BOFT consume more GPU memory because of caching more activations. It is slow because of the calculation of inverse. ~ 2x of LoRA's training time

wy1iu commented 2 months ago

Hi @BaohaoLiao , the reason you get around 1.4M is that you include all the parameters in the skew symmetric matrix (in the Cayley Parameterization) as the trainable parameters. However, you only need to have around half of the skew-symmetric matrix to be trainable (since one half of the matrix (including the diagonal) is the negative version of the other half). That's the reason we get around 0.8M.

BaohaoLiao commented 2 months ago

Hi @wy1iu , thank you for your explanation. Do you mean that 1.4M is the trainable parameters during training, and 0.8M is the parameters to save, because of the skew symmetric matrix? Because I can't find a flag to control only tuning half of the skew-symmetric matrix.

wy1iu commented 2 months ago

@BaohaoLiao You can simply initialize the half of the matrix, and then copy them as the other half (don't forget the negative). Then you only need to train half of the matrix, as the other half is only a negative times its copy.

Edenzzzz commented 2 months ago

If I remember correctly, in pytorch it's not straightforward to enable gradient for half of a matrix and disable it for the other half

wy1iu commented 1 month ago

@Edenzzzz For example, you just need to initialize a vector with the dimension equal to half of the matrix, and then assemble a matrix by changing the shape of this vector.