Is there a need to integrate Byte-pair encodings (BPE)?

zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.

Other

193 stars 38 forks source link

Thanks @GeorgeS2019 for your suggestion.

I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing,

However, since it's a part of data processing, and include several key steps, such as model training, encoding and decoding, I prefer to create separate project for it rather than integrating it to Seq2SeqSharp project.

So, in my opinion, my plan would be 1) Create a project for BPE training/encoding/decoding called SubwordSharp. :) 2) Create a training pipeline to integrate SubwordSharp BPE model training, BPE encoding, and Seq2SeqSharp training, BPE decoding and evaluation steps together, and 3) Create a runtime pipeline to integrate BPE encoding, Seq2SeqSharp inference, BPE decoding together.

zhongkaifu / Seq2SeqSharp

Is there a need to integrate Byte-pair encodings (BPE)? #13