Help wanted! What is the purpose of adapter tuning in the code? Why does the adapter need to introduce a certain number of virtual tokens?

xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models

https://arxiv.org/abs/2201.05966

Apache License 2.0

550 stars 58 forks source link

Help wanted! What is the purpose of adapter tuning in the code? Why does the adapter need to introduce a certain number of virtual tokens? #43

Closed kanseaveg closed 9 months ago

kanseaveg commented 1 year ago

I found differences between the paper and code of Unified-SKG. The paper does not mention any experiments or results related to adapter tuning, but I discovered adapter tuning code in the codebase. Moreover, this adapter tuning code requires carrying 10 virtual tokens for tuning. Why is that?

ChenWu98 commented 1 year ago

Hi! The adapter was one of our initial trials. If I remember correctly, the adapter and the prefix tuning gave us very similar results. Due to computation constraints, we just chose one to run all experiments on. Since this is not officially reported in the paper (and it has been a long time since then), we cannot provide further support for the adapter tuning :(

kanseaveg commented 1 year ago

@ChenWu98 Thank you for your patient explanation and reply. Also, I would like to ask if you have ever counted the number of parameters fine-tuned for unified_skg. Some people consider the parameter number only introduced by the structure (such as the knowledge_trans function) as the prefix-tuning number. However, I think it should be the amount of parameters that need gradient training divided by the total number of global parameters. Taking the example of the prefixtuning.py file, which requires gradient calculation part. I think this part is also a part of prefix tuning, so I think this part should also be included in the calculation of the number of fine-tuning parameters. What do you think?

Looking forward to and thank you again for your reply.

ChenWu98 commented 11 months ago

It depends. At inference, the prefix values (that is, the output of what you highlighted) can be pre-computed so it's fair to count only the prefix values instead of all the overparameterized weights, while for training, this overparameterization does incur additional computational costs. So if we care about representation capability, we may only count the prefix values; if we care about memory, etc, we may count all the actual weights. Just my personal take!