xiangking / ark-nlp

A private nlp coding package, which quickly implements the SOTA solutions.
Apache License 2.0
311 stars 64 forks source link

更改TransfomerTokenizer对未登录词的处理 #58

Closed xiangking closed 2 years ago

xiangking commented 2 years ago

PR types

Fix

PR changes

修复Tokenizer

Description

  1. 新增WordpieceTokenizer类
  2. 将transformers库中的WordpieceTokenizer对不存在词典中的词会将其整体视为unk_token的操作改为按字或字母视为unk_token,Closes #57