Open BaohaoLiao opened 7 months ago
Did you count the classification head?
Hi @Edenzzzz, As I mentioned above, the classifier parameters are excluded. You can also check my manual calculation.
For DeBERTa-v3-base, the hidden_dim=768, intermediate_dim=3072.
Oh sry. I also wonder if you're able to reproduce any of the experiment results?
Hi @Edenzzzz , I do some experiments on RoBERTa, hope it helps you. One notable thing is that boft consumes more GPU memory than LoRA because of the calculation of inverse.
Many thanks! I was suspecting multiplying multiple butterfly matrices may be speed and memory deficient due to storing activations. I also wonder if this is your ongoing work (Trying to borrow/cite some reproduced BOFT results)?
Yes, it is my ongoing project for NeurIPS 2024. Unfortunately, I haven't released it.
You can use my results if you want. My experimental setting follows https://arxiv.org/pdf/2402.15179. About the reproduction of OFT and BOFT, I use the hyper-parameter in the BOFT paper. Notably, I only tune the learning rate of BOFT and OFT, and keep other hyper-parameter the same. To tune the learning rate, if the reported learning rate is 1e-4, I expand it to {8e-5 1e-4 3e-4}. If it is 3e-4, I expand it to {1e-4 3e-4 5e-4}. In sum, with about 2 as the interval.
BTW, you are right. BOFT consume more GPU memory because of caching more activations. It is slow because of the calculation of inverse. ~ 2x of LoRA's training time
Hi @BaohaoLiao , the reason you get around 1.4M is that you include all the parameters in the skew symmetric matrix (in the Cayley Parameterization) as the trainable parameters. However, you only need to have around half of the skew-symmetric matrix to be trainable (since one half of the matrix (including the diagonal) is the negative version of the other half). That's the reason we get around 0.8M.
Hi @wy1iu , thank you for your explanation. Do you mean that 1.4M is the trainable parameters during training, and 0.8M is the parameters to save, because of the skew symmetric matrix? Because I can't find a flag to control only tuning half of the skew-symmetric matrix.
@BaohaoLiao You can simply initialize the half of the matrix, and then copy them as the other half (don't forget the negative). Then you only need to train half of the matrix, as the other half is only a negative times its copy.
If I remember correctly, in pytorch it's not straightforward to enable gradient for half of a matrix and disable it for the other half
@Edenzzzz For example, you just need to initialize a vector with the dimension equal to half of the matrix, and then assemble a matrix by changing the shape of this vector.
@BaohaoLiao @wy1iu @Edenzzzz I have the same question about that number of parameters in BOFT paper.
Hi,
thank you for this inspiring work!
I'm reproducing the reported results in the paper for the GLUE benchmark with DeBERTa-v3-base and peft. Here is my settings:
I noticed that the number of trainable parameters (the classifier parameters are excluded) is 1,410,048 for OFT and BOFT. However, the reported number in
[Table 1](https://arxiv.org/pdf/2311.06243)
is about 0.8M.I also calculated the number of trainable parameters by hand for BOFT as: ((2 96 8 8 + 768) 4 + (2 96 8 8 + 3072) + (2 384 8 8 + 768)) * 12 = 1,410,048
Could you tell me whether I did something wrong?