Closed kanseaveg closed 9 months ago
Hi! The adapter was one of our initial trials. If I remember correctly, the adapter and the prefix tuning gave us very similar results. Due to computation constraints, we just chose one to run all experiments on. Since this is not officially reported in the paper (and it has been a long time since then), we cannot provide further support for the adapter tuning :(
@ChenWu98 Thank you for your patient explanation and reply. Also, I would like to ask if you have ever counted the number of parameters fine-tuned for unified_skg. Some people consider the parameter number only introduced by the structure (such as the knowledge_trans function
) as the prefix-tuning number.
However, I think it should be the amount of parameters that need gradient training divided by the total number of global parameters.
Taking the example of the prefixtuning.py file, which requires gradient calculation part.
I think this part is also a part of prefix tuning, so I think this part should also be included in the calculation of the number of fine-tuning parameters. What do you think?
Looking forward to and thank you again for your reply.
It depends. At inference, the prefix values (that is, the output of what you highlighted) can be pre-computed so it's fair to count only the prefix values instead of all the overparameterized weights, while for training, this overparameterization does incur additional computational costs. So if we care about representation capability, we may only count the prefix values; if we care about memory, etc, we may count all the actual weights. Just my personal take!
I found differences between the paper and code of Unified-SKG. The paper does not mention any experiments or results related to adapter tuning, but I discovered adapter tuning code in the codebase. Moreover, this adapter tuning code requires carrying 10 virtual tokens for tuning. Why is that?