Hi, appreciate the amazing works! I have one questions about embed-wise pooling attention: how do you determine the embedding dimensions D_h and D_w? In the code D_h is hardcoded as 32, is there any reasons or intuitions why you choose this value? Also, have you ever tried using other dimensions (e.g. D_h=8, which results in a squared embedded feature if D=64) and see if the performance would be improved?
Hi, appreciate the amazing works! I have one questions about embed-wise pooling attention: how do you determine the embedding dimensions
D_h
andD_w
? In the codeD_h
is hardcoded as 32, is there any reasons or intuitions why you choose this value? Also, have you ever tried using other dimensions (e.g.D_h=8
, which results in a squared embedded feature if D=64) and see if the performance would be improved?