Closed zhengjiawei001 closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.
根据transformer里面的vocab.txt文件,可以看出来是字级别的,并且其中的源码也是针对字级别的token进行处理,是以一定概率对input_ids进行#号处理,如果遇到带#的字,认为这个字和前面的字组成词语,同时mask,但是哈工大里面又提到了分词工具,还有分词文本,这个是针对中文采用分词,一个token可以代表一个词吗?