Open yandun72 opened 1 year ago
Hi @yandun72. If the shape of x is (batch, seq, hidden_size), you can permute the shape as (seq, batch, hidden_size) or set batch_first=true.
Sorry for the description of cross-attention that is confusing you. In batchformer, we incorporate attention across the batch dimension. Therefore, the cross-attention is not specific attention, but transformer attention. We just want to emphasize the batch dimension. You can regard it as cross-batch attention.
Regards,
Hi @yandun72. If the shape of x is (batch, seq, hidden_size), you can permute the shape as (seq, batch, hidden_size) or set batch_first=true.
Sorry for the description of cross-attention that is confusing you. In batchformer, we incorporate attention across the batch dimension. Therefore, the cross-attention is not specific attention, but transformer attention. We just want to emphasize the batch dimension. You can regard it as cross-batch attention.
Regards,
Thanks for your reply!I have got it!
Hi,I know TransformerEncoderLayer(C,4,C,0.5) C 4 C is the d_model n_head and dim_feedforward meaning.
and x.unsqueeze(1) becomes N 1 C shape。
Because batch_first is false for transformer,so it will do self attention at batch dim, but i am confused with what you said in the paper of cross attention. I cant read the cross attention in the pseudo code,can you give me a interpretation about it.By the way what if x shape is batch seq hidden_size?Because for NER task,its shape is that。 In this situation how to apply batchformer?hope for your sincere reply!