Closed jongjyh closed 2 years ago
Hi,
I haven't run experiments on BERT before. Do you have the performance of the BERT model on MRPC with full fine-tuning (updating all parameters)? Also, parameter-efficient training usually needs larger learning rates compared to full fine-tuning. Using 10x larger learning rate is usually a good starting point.
BERT usually got 85% acc higher or more on MRPC
Hey, I also feel curious about the Memory Usage. I reproduced your exp of T5-base on MRPC, founding 8.8 GB of Memory Usage according to the command nvidia-smi
. Whereas the papers shows only needing 5.5GB, Could you explain it? Thanks!
Then maybe try a larger learning rate first.
The memory cost may vary depends on the model architecture. What's the memory usage of full fine-tuning on BERT? It should use fewer memory of that.
Also, tell me more details and concrete ideas if you want me to help, like what you have tried and what you suspect could be issues. It's hard to give useful suggestions if I only know the final accuracy and memory cost.
Actually in Table 1, the T5-base + LST is trained with dropping 3 layers each in the encoder and decoder, making the memory reduce from 7GB (without dropping) to 5.5GB.
Reference: In the page 7 of the paper, We drop 6 layers (3 layers each in side encoder and decoder) for LST to match the parameter usage of the Adapter and LoRA.
Thank you, I think I made a mistake! everything is just alright.
It's good to know!
How do you monitor memory usage? Could you offer the script?
Hi, The work is great! Is there any GLUE experiments result on Bert? I reproduced LST Bert in MRPC getting accuracy of 75%, is it normal? Thanks! :)