xiangking / ark-nlp

A private nlp coding package, which quickly implements the SOTA solutions.
Apache License 2.0
310 stars 65 forks source link

Fix: TokenTokenizer存在分词忽略空格的问题 #37

Closed xiangking closed 2 years ago

xiangking commented 2 years ago

Environment info

Python 3.8.10
ark-nlp 0.0.7


Information

tokenizer.tokenize('森麥康 小米3 M4 M5 5C 5X 5S 5Splus mi 6 6X电源开机音量按键排线侧键 小米5C 开机音量排线')

>>> 
['森',
 '麥',
 '康',
 '小',
 '米',
 '3',
 'm',
 '4',
 'm',
 '5',
 '5',
 'c',
 '5',
 'x',
 '5',
 's',
 '5',
 's',
 'p',
 'l',
 'u',
 's',
 'm',
 'i',
 '6',
 '6',
 'x',
 '电',
 '源',
 '开',
 '机',
 '音',
 '量',
 '按',
 '键',
 '排',
 '线',
 '侧',
 '键',
 '小',
 '米',
 '5',
 'c',
 '开',
 '机',
 '音',
 '量',
 '排',
 '线']
xiangking commented 2 years ago

可使用下面方法重写类,下一版本会修复该bug

from ark_nlp.processor.tokenizer.transfomer import TransfomerTokenizer

class TokenTokenizer(TransfomerTokenizer):
    """
    Transfomer文本编码器,用于按字符进行分词、ID化、填充等操作

    Args:
        vocab: transformers词典类对象、词典地址或词典名,用于实现文本分词和ID化
        max_seq_len (:obj:`int`): 预设的文本最大长度
    """  # noqa: ignore flake8"

    def tokenize(self, text, **kwargs):
        tokens = []
        for token_ in text:
            tokenized_token_ = self.vocab.tokenize(token_)
            if tokenized_token_ == []:
                tokens.extend([token_])
            else:
                tokens.extend(tokenized_token_)

        return tokens

    def sequence_to_ids(self, sequence, **kwargs):
        return self.sentence_to_ids(sequence, **kwargs)