zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
197 stars 37 forks source link

Request for more challenging Transformer architecture use cases through a better performance tokenizer .NET library #33

Closed GeorgeS2019 closed 2 years ago

GeorgeS2019 commented 2 years ago

Seq2SeqSharp is a valid alternative option for .NET Transformer architecture solution.

It seems with a cross platform .NET tokenizer library, especially with better performance than those provided through python library, this will make it less challenging for Seq2SeqSharp to explore other Transformer architecture real world End-To-End examples such as e.g. GPT2, BERT etc.

Raising this issue to promote user here to share their feedback for a concerting effort towards such .NET tokenization library.

zhongkaifu commented 2 years ago

Hi @GeorgeS2019 ,

Thanks for your comments.

I may not understand why you need to have a .NET tokenization library. Can you please specify which scenario you would like to use it ?

For Seq2SeqSharp, you can use any tokenization library for data processing. Seq2SeqSharp only care about tokens as input and it also outputs tokens. For example: In the Release Package, if you open test batch file, such as test_enu_chs.bat, you will find it calls "spm_encode.exe" firstly to encode given input sentences to BPE tokens, then call Seq2SeqConsole tool, and finally calls "spm_decode.exe" to decode BPE tokens back to sentences. Both "spm_encode" and "spm_decode" are from Google's SentencePiece project.

In addition, the release package includes vocabularies and models for 8 languages (Chinese, German, English, French, Italian, Japanese, Korean and Russian) so far. They were all built by SentencePiece library.

GeorgeS2019 commented 2 years ago

@zhongkaifu The BlingFire of Microsoft provide HuggingFace tokenizers very similar to those provided by HuggingFace BUT with claimed better performance.

e.g. GPT2 tokenizer provided by BlingFire matches exactly the Vocab size as that of HuggingFace. The library provide additional information on how to create your custom tokenizer based on the diverse templates (close to complete) as those of HuggingFace