Regarding tuning the prefixes

xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models

https://arxiv.org/abs/2201.05966

Apache License 2.0

546 stars 58 forks source link

Regarding tuning the prefixes #32

Closed base-y closed 2 years ago

base-y commented 2 years ago

@Timothyxxx Hi, I have few queries regarding prefix tuning. Can you shed your opinions on the same since you have worked extensively on it (the github repo of Prefix tuning seems abandoned).

Will training the whole model along with prefixes yield a better performance (assuming its not low data regime - greater than 1k datapoints)?
Assume we train the prefixes in two cases: freezing the model and not freezing the model. Will the learned representation of prefixes for both the cases be the same or different?
If I train the model+prefixes in two scenarios: (i) train the model and prefixes together (finetune - single shot) and, (ii) finetune the model only first, freeze it, and then train the prefixes only with the frozen weights (two stages). Is there any difference between both of them (I think there is) and would like to know about why the diff?

Timothyxxx commented 2 years ago

Hi, thanks for asking!

It is generally tested, that prompt methods is better when we only have few training data. But we have no clue what is the number of examples when it is enough also for finetuning.
Of course not the same.
To be honest, we tried all your thoughts before, and our results showed(at least in SKG tasks), the setting i and setting ii got very similar performance. You can try yourself again to test if we were missing anything.

Hope these information helpful!

Thanks!

base-y commented 2 years ago

@Timothyxxx thank you. But for Q2, why would the final prefix representations not be the same? Since the gradients of are different when the model is frozen and not frozen? Also for Q1, I see people have extensively spoken about the effectiveness of prompts in PROMPT TUNING (that they work with larger models and low data regime) but I havent seen many people talking about effectiveness of PREFIX TUNING in the same way.

Timothyxxx commented 2 years ago

It's easy to understand. Just imagine the first step when loss was propagated into the LM model and prefix-module. They will all be updated if you let the LM unfreeze. But will not change the LM when freezing that.
The efficiency in low-resource is discussed in paper of prefix-tuning. You can refer to that.

base-y commented 2 years ago

@Timothyxxx Thank you for the answers. But for answer 1, I understand the model performance will vary because the LM model is updated (in unfreezed model case) when the loss is propagated backward. But I think the gradient flowing to the prefix representation (prefix learnable parameter) is the same in both the cases (model frozen/unfrozen). In this case, though the entire model performance might be different, the final prefix representations will be the same I guess (since they are updated with the same gradient at every step irrespective of whether the model is frozen or not)?

Timothyxxx commented 2 years ago

I still don't understand, I recommend try yourself and compare the weight of two LM (frozen/un-frozen).