some potential issues for your transformer

hello, I am very respect for your work and your idea. But when I tested your code and debuged it, I discovered that some potential issues may arise in your code. I decide to feed it back to great authors.

the "transformer.py" requires that the shape of input tokenized is [seq_len, batch_size], like this:


def forward(self, src: Tensor) -> Tensor:
        """
        Args:
            src: Tensor, shape [seq_len, batch_size]
            src_mask: Tensor, shape [seq_len, seq_len]
        Returns:
            output Tensor of shape [seq_len, batch_size, ntoken]
        """
        src = self.token_encoder(src) * math.sqrt(self.d_model) 
        # add positional encoding
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src)

But when I checked your finetuning code, I found the exact size of input is torch.size[batch_size, seq_len], which leads the self.pos_encoder(src) to be wrong, due to in the addition of two tensors of different dimensions，therefore the 'x' of x = x + self.pe[:x.size(0)] in your pos_encoder would be [batch_size, seq_len, embedding_dim] instead of [seq_len, batch_size, embedding_dim]. like this:

I think it could not result output of transformer as we expect, even lead unexpected result in other downstream model? These also happened in "test_transfomer.py" when input is just one sample with torch.size([seq_len]), but it can be solved by input=input.unsqueeze(1).

I believe this is just a joke the great authors made when open sourcing the code and should have no impact on your pre-training process.

Sincerely hope to get your reply!!!

Maybe just modifying this is right？

class PositionalEncoding(nn.Module):

def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 2048):

    super().__init__()
    self.dropout = nn.Dropout(p=dropout)
    position = torch.arange(max_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
    # pe = torch.zeros(max_len, 1, d_model)
    pe = torch.zeros(max_len, d_model)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    self.register_buffer('pe', pe)

def forward(self, x: Tensor) -> Tensor:
    """
    Args:
        x: Tensor, shape [batch_size, seq_len, embedding_dim]
    """
    x = x + self.pe[:x.size(1)].unsqueeze(0)
    return self.dropout(x)

zcao0420 / MOFormer

some potential issues for your transformer #9