wwm mask 细节 - Githubissues

ymcui / Chinese-BERT-wwm

Pre-Training with Whole Word Masking for Chinese BERT（中文BERT-wwm系列模型）

https://ieeexplore.ieee.org/document/9599397

Apache License 2.0

9.56k stars 1.38k forks source link

wwm mask 细节 #219

Closed zhengjiawei001 closed 2 years ago

zhengjiawei001 commented 2 years ago

根据transformer里面的vocab.txt文件，可以看出来是字级别的，并且其中的源码也是针对字级别的token进行处理，是以一定概率对input_ids进行#号处理，如果遇到带#的字，认为这个字和前面的字组成词语，同时mask，但是哈工大里面又提到了分词工具，还有分词文本，这个是针对中文采用分词，一个token可以代表一个词吗？

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.