ymcui / Chinese-BERT-wwm

Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
https://ieeexplore.ieee.org/document/9599397
Apache License 2.0
9.56k stars 1.38k forks source link

wwm模型加载的时候tokenizer出来都是一个个的字,这样对吗? #200

Closed rmbone closed 2 years ago

rmbone commented 2 years ago

from transformers import AutoTokenizer tokenizer_auto = AutoTokenizer.from_pretrained(“hfl/chinese-roberta-wwm-ext-large”)

tokens2 = tokenizer_auto("使用语言模型来预测下一个词的probability。") print(tokens2) print(tokenizer_auto.decode(tokens2["input_ids"]))

{'input_ids': [101, 886, 4500, 6427, 6241, 3563, 1798, 3341, 7564, 3844, 678, 671, 702, 6404, 4638, 8376, 8668, 13254, 511, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} [CLS] 使 用 语 言 模 型 来 预 测 下 一 个 词 的 probability 。 [SEP]

怎么没见 模 ##型 这样的呢??

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.