yikunpku / RNA-MSM

Nucleic Acids Research 2024:RNA-MSM model is an unsupervised RNA language model based on multiple sequences that outputs both embedding and attention map to match different types of downstream tasks.
https://aigene.cloudbastion.cn/#/rna-msm
MIT License
42 stars 4 forks source link

No EOS token appended #10

Open ZhiyuanChen opened 3 months ago

ZhiyuanChen commented 3 months ago

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.

import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet

# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))

The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.

The inputs is:

tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])

Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}), it appears <eos> token is not appended by the vocab.

tBai1994 commented 2 months ago

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.

import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet

# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))

The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.

The inputs is:

tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])

Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}), it appears <eos> token is not appended by the vocab. Hi, I would like to ask if your problem has been solved? I also encountered similar problems

yikunpku commented 2 months ago

Hi, @ZhiyuanChen @tBai1994 the reason for this issue is that we did not append the eos token, which is consistent with MSA Transformer. If you would like to append this special token, you can do so by setting append_eos = True at the following link: https://github.com/yikunpku/RNA-MSM/blob/43d3d93e5402a018af2b35003825485f4ccc96f3/msm/data.py#L166-L172