Closed GeorgeS2019 closed 2 years ago
Hi @GeorgeS2019 ,
Thanks for your comments.
I may not understand why you need to have a .NET tokenization library. Can you please specify which scenario you would like to use it ?
For Seq2SeqSharp, you can use any tokenization library for data processing. Seq2SeqSharp only care about tokens as input and it also outputs tokens. For example: In the Release Package, if you open test batch file, such as test_enu_chs.bat, you will find it calls "spm_encode.exe" firstly to encode given input sentences to BPE tokens, then call Seq2SeqConsole tool, and finally calls "spm_decode.exe" to decode BPE tokens back to sentences. Both "spm_encode" and "spm_decode" are from Google's SentencePiece project.
In addition, the release package includes vocabularies and models for 8 languages (Chinese, German, English, French, Italian, Japanese, Korean and Russian) so far. They were all built by SentencePiece library.
@zhongkaifu The BlingFire of Microsoft provide HuggingFace tokenizers very similar to those provided by HuggingFace BUT with claimed better performance.
e.g. GPT2 tokenizer provided by BlingFire matches exactly the Vocab size as that of HuggingFace. The library provide additional information on how to create your custom tokenizer based on the diverse templates (close to complete) as those of HuggingFace
Seq2SeqSharp is a valid alternative option for .NET Transformer architecture solution.
It seems with a cross platform .NET tokenizer library, especially with better performance than those provided through python library, this will make it less challenging for Seq2SeqSharp to explore other Transformer architecture real world End-To-End examples such as e.g. GPT2, BERT etc.
Raising this issue to promote user here to share their feedback for a concerting effort towards such .NET tokenization library.