Open ZhiyuanChen opened 3 months ago
Hi,
Thank you for this wonderful work!
When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.
import __main__ import torch from model import MSATransformer from utils.tokenization import Vocab from msm.data import Alphabet # evil hack __main__.Config = dict() __main__.OptimizerConfig = dict() __main__.MSATransformerModelConfig = dict() __main__.DataConfig = dict() __main__.TrainConfig = dict() __main__.LoggingConfig = dict() pretrained = "RNA_MSM_pretrained.ckpt" alphabet = Alphabet.from_architecture("rna language") vocab = Vocab.from_esm_alphabet(alphabet) tokenizer = vocab.encode model = MSATransformer(vocab, num_layers=10) model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict']) model.eval() sequence = "UAGCNUAUCAGACUGAUGUUGA" inputs = torch.tensor(tokenizer(sequence))[None, None, :] o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))
The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.
The inputs is:
tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])
Since vocab is
Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11})
, it appears<eos>
token is not appended by the vocab. Hi, I would like to ask if your problem has been solved? I also encountered similar problems
Hi, @ZhiyuanChen @tBai1994 the reason for this issue is that we did not append the eos token, which is consistent with MSA Transformer. If you would like to append this special token, you can do so by setting append_eos = True
at the following link: https://github.com/yikunpku/RNA-MSM/blob/43d3d93e5402a018af2b35003825485f4ccc96f3/msm/data.py#L166-L172
Hi,
Thank you for this wonderful work!
When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.
The length of sequence is 22, so inputs should have 24 tokens (with and )
But it only has 23 tokens.
The inputs is:
Since vocab is
Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11})
, it appears<eos>
token is not appended by the vocab.