Closed EllaHxyz closed 11 months ago
Hi @EllaHxyz , yes you are right, we just treat M·P as the feature channel, so there are M·P features now. Then this is directly use to predict the output (B, M, T) via transformer and linear head.
With an output of [B,D,N] from the transformer encoder, how did you reshape it to [B,M,T]? Could you give more details on the implementation after the transformer? Thanks!
Hi, You can flatten the last 2 dimensions and use a linear layer to do it.
Hi, May I ask how do you implement the channel-mixing model for the comparison (figure 7)? In the paper, it mentioned to reshape (B,M,P,N) to (B,MP,N), but it's not very clear what comes next? Did you feed the (B, MP,N) to projection/embedding to produce (B, D, N)? How did you shape it back to (B,M,D,N) after the transformer encoder? If the code is present in the repository and I overlook, please kindly advise me where to find it. Thank you!