The shape of the input feature for a transformer is generally (batch, tokens, dim).
As stated in the paper, BatchFormer performs attention at the batch level, but the input shape of this attention layer is (batch, 1, dims) via squeeze operation.
I am considering whether its shape should be like (1, batch, dim)?
Maybe I misunderstood something.
Looking forward to your reply!
The shape of the input feature for a transformer is generally (batch, tokens, dim). As stated in the paper, BatchFormer performs attention at the batch level, but the input shape of this attention layer is (batch, 1, dims) via squeeze operation. I am considering whether its shape should be like (1, batch, dim)? Maybe I misunderstood something. Looking forward to your reply!