Closed GeorgeS2019 closed 3 years ago
Thanks @GeorgeS2019 for your suggestion.
I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing,
However, since it's a part of data processing, and include several key steps, such as model training, encoding and decoding, I prefer to create separate project for it rather than integrating it to Seq2SeqSharp project.
So, in my opinion, my plan would be 1) Create a project for BPE training/encoding/decoding called SubwordSharp. :) 2) Create a training pipeline to integrate SubwordSharp BPE model training, BPE encoding, and Seq2SeqSharp training, BPE decoding and evaluation steps together, and 3) Create a runtime pipeline to integrate BPE encoding, Seq2SeqSharp inference, BPE decoding together.
Byte-pair encodings (BPE) are now very commonly used in NLP.
Is there a plan in future to integrate BPE in Sep2SeqSharp?
If so, will that be a c# wrapper (e.g. swift wrapper) around e.g. FastBPE.
Would you consider a pure C# version of e.g. FastBPE [ link to pure python FastBPE ]?
This issue is more a feature proposal. Looking forwards to get some feedback