tokenizer - Githubissues

ymcui / Chinese-BERT-wwm

Pre-Training with Whole Word Masking for Chinese BERT（中文BERT-wwm系列模型）

https://ieeexplore.ieee.org/document/9599397

Apache License 2.0

9.56k stars 1.38k forks source link

tokenizer #191

Closed hongjianyuan closed 3 years ago

hongjianyuan commented 3 years ago

你好，你们使用的的是bert的wordpiece分词，但是我们在重新复现的时候，发现只有中文单字，没有连接符，例如蔷字，你们tokenizer有蔷和##蔷，但是我们复现完只有蔷，想知道你们在使用wordpiece的具体细节，以及用了哪一个库或者包？我们用的是https://github.com/huggingface/tokenizers