Closed TiankaiHang closed 3 years ago
Hi, thanks for your nice work!
I have one question, here you introduce linear projection W_A The parameters of W_A (r H H) are related to the ratio. However, in your ablation study(table 1), That confuses me... Can i ask why?
Hi Tiankai,
Thanks for your interests in this work! For your question, the number of parameters included in the linear expansion/project layer is calculated as #Heads^2 #Expansion. As we typically use 12 heads, the total number of parameter overheads with expansion ratio of 6 is 144 6. Thus, when combined with the total number of parameters in the model, there are actually negligible. If we take two more decimal precision, they are reflected in a magnitude of less than 0.1M.
I hope this clarify your question. If you need any other clarifications, do drop me a message.
Thanks for your kind reply :-)
Hi, thanks for your nice work!
I have one question, here you introduce linear projection W_A The parameters of W_A (r H H) are related to the ratio. However, in your ablation study(table 1), That confuses me... Can i ask why?