No EOS token appended - Githubissues

ZhiyuanChen commented 3 months ago

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.

import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet

# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))

The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.

The inputs is:

tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])

Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}), it appears <eos> token is not appended by the vocab.

tBai1994 commented 2 months ago

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.
import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet

# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))
The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.

The inputs is:
tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])
Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}), it appears <eos> token is not appended by the vocab. Hi, I would like to ask if your problem has been solved? I also encountered similar problems

yikunpku commented 2 months ago

Hi, @ZhiyuanChen @tBai1994 the reason for this issue is that we did not append the eos token, which is consistent with MSA Transformer. If you would like to append this special token, you can do so by setting append_eos = True at the following link: https://github.com/yikunpku/RNA-MSM/blob/43d3d93e5402a018af2b35003825485f4ccc96f3/msm/data.py#L166-L172

yikunpku / RNA-MSM

No EOS token appended #10