yagays / pytorch_bert_japanese

35 stars 7 forks source link

Add is_tokenized param to be able to skip a tokenizing process optimally #1

Open Lyuji282 opened 5 years ago

Lyuji282 commented 5 years ago

A developer wants to separate tokenizing processes and getting embeddings, thus I implement is_tokenized flag.

yagays commented 5 years ago

Thank you for your PR. This code is optimized for the "BERT日本語Pretrainedモデル". It is trained with Juman++ and not supposed to use other tokenizers. I also think it isn't a good idea to use is_tokenized param because text augument is originally string typed but it needs to be list type when is_tokenized is True.

Lyuji282 commented 5 years ago

Thank you for replying me. Umm I know that the type of tokenizer is fixed on a BERT pretrained model. I just want to separate tokenization and model-applying server. Certainly, double types of argument is not a good idea, however, list argument is better as like bert-as-service.