yikangshen / MoA

Mixture of Attention Heads
BSD 3-Clause "New" or "Revised" License
36 stars 4 forks source link

Inconsistency between the paper and the code #2

Open RobertCsordas opened 1 year ago

RobertCsordas commented 1 year ago

Hi,

I noticed that there is an inconsistency Eq.9 in the paper (https://arxiv.org/pdf/2210.05144.pdf) and https://github.com/yikangshen/MoA/blob/master/moa_layer/parallel_linear/moe.py#L124C8-L125C69. Could you please clarify which version was used for the numbers in the paper?

Thank you, Robert

yikangshen commented 6 months ago

Hi Robert,

Normalization can be considered an optional feature. In my experience, using normalization could result in slightly better performance when k=2.

Regards, Yikang