xiangking / ark-nlp

A private nlp coding package, which quickly implements the SOTA solutions.
Apache License 2.0
310 stars 65 forks source link

SpanTokenizer会导致token_mapping的索引不正确问题 #57

Closed zhw666888 closed 2 years ago

zhw666888 commented 2 years ago

Environment info: ark-nlp 0.0.9 python 3.9

Information: 在使用bert模型时,用SpanTokenizer会导致tokenmapping的索引不正确。 比如输入以下句子时的结果(其中下划线表示空格): input:B o s e S o u n d S p o r t F r e e 真 无 线 蓝 牙 耳 机 tokens:['[UNK]', '[unused1]', '[UNK]', '[unused1]', '[UNK]', '[unused1]', '真', '无', '线', '蓝', '牙', '耳', '机'] token_mapping:[[0], [1], [2], [3], [4], [5], [21], [22], [23], [24], [25], [26], [27]] 正确的token_mapping应该是如下的: [[0,1,2,3], [4], [5,6,7,8,9,10,11,12,13,14], [15], [16,17,18,19], [20], [21], [22], [23], [24], [25], [26], [27]]