zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
193 stars 38 forks source link

Is there a need to integrate Byte-pair encodings (BPE)? #13

Closed GeorgeS2019 closed 3 years ago

GeorgeS2019 commented 3 years ago

Byte-pair encodings (BPE) are now very commonly used in NLP.

Is there a plan in future to integrate BPE in Sep2SeqSharp?

If so, will that be a c# wrapper (e.g. swift wrapper) around e.g. FastBPE.

Would you consider a pure C# version of e.g. FastBPE [ link to pure python FastBPE ]?

This issue is more a feature proposal. Looking forwards to get some feedback

zhongkaifu commented 3 years ago

Thanks @GeorgeS2019 for your suggestion.

I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing,

However, since it's a part of data processing, and include several key steps, such as model training, encoding and decoding, I prefer to create separate project for it rather than integrating it to Seq2SeqSharp project.

So, in my opinion, my plan would be 1) Create a project for BPE training/encoding/decoding called SubwordSharp. :) 2) Create a training pipeline to integrate SubwordSharp BPE model training, BPE encoding, and Seq2SeqSharp training, BPE decoding and evaluation steps together, and 3) Create a runtime pipeline to integrate BPE encoding, Seq2SeqSharp inference, BPE decoding together.