What is SNT file and how to create a new file

zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.

Other

193 stars 38 forks source link

What is SNT file and how to create a new file #52

Closed axel578 closed 1 year ago

axel578 commented 1 year ago

Hello !

I would like to know if it would be possible to create a new vocab snt file, I looked at this file in notepad but I'm note sure it's for vocabulary.

For my case I want to generate text which is some sort of xml file with a limited number of token. Every snt file available aren't suited to what I'm trying to do.

I would like to know if it's possible in any way to generate a new snt file based on a custom vocab

zhongkaifu commented 1 year ago

Hi @axel578 ,

In the demo and release package, SNT is data set for training and test rather than vocab file.

For vocab file, it could be either generated from SNT file or use external files as vocab files.

In vocab file, one token per line and each line has two parts: [token] \t [weights] [weight] could be any value you want. Seq2SeqSharp doesn't use these [weight] for now.

Thanks Zhongkai Fu

axel578 commented 1 year ago

Thanks for the answer !

I'd like to know what are the two src and target model in fiction text generation (enuSpm.model)

.\bin\Seq2SeqConsole\Seq2SeqConsole.exe -Task Test -ModelFilePath .\model\seq2seq_fiction.model -InputTestFile .\data\test\test_fiction.txt -OutputPromptFile .\data\test\test_fiction.txt -OutputFile out_fiction.txt -MaxTestSrcSentLength 256 -MaxTestTgtSentLength 512 -ProcessorType CPU -SrcSentencePieceModelPath .\spm\enuSpm.model -TgtSentencePieceModelPath .\spm\enuSpm.model -BeamSearchSize 1 -DeviceIds 0,1,2,3 -DecodingStrategy Sampling -DecodingRepeatPenalty 10

zhongkaifu commented 1 year ago

For this command line, "test_fiction.txt" is input file. It's used as input for encoder, and prompt for decoder. "out_fiction.txt" is output file generated by decoder.

GeorgeS2019 commented 1 year ago

@zhongkaifu

Different Text Generation Strategy: ArgMax, Beam Search, Top-P Sampling

Just curious, how is this language generation implemented similar or dissimilar to e.g. GPT-x ?

zhongkaifu commented 1 year ago

Hi @GeorgeS2019

You could use Seq2SeqSharp to train GPT-x models only if you have training data set for it. They are all Transformer based model and data set is masked text.

axel578 commented 1 year ago

Yes but what is .\spm\enuSpm.model for ? is it for the vocabulary, because on my scenario the vocabulary is a bunch of code different than the usual vocabulary

zhongkaifu commented 1 year ago

enuSpm.model is for SentencePiece to encode/decode for subword level tokens. Seq2SeqSharp can directly call APIs in SentencePiece for subword level encoding and decoding. SentencePiece has its own vocabulary in subword level, and it's different with your vocabulary.

I don't think you need to care about it, because with parameters "-SrcSentencePieceModelPath" and -TgtSentencePieceModelPath", Seq2SeqSharp can automatically encode word in your vocabulary to subword in model vocabulary, and decode subword back to word. With these two parameters, if you don't have vocabulary in subword level, you can set "SrcVocab" and "TgtVocab" to empty, and ask Seq2SeqSharp to generate vocabulary from training set. For inference, model itself already includes vocabulary.

GeorgeS2019 commented 1 year ago

@zhongkaifu

You could use Seq2SeqSharp to train GPT-x models only if you have training data set for it

Is importing trained weights from e.g. GPT2 .onnx into Seq2SeqSharp model to avoid training is still part of a long term plan?

zhongkaifu commented 1 year ago

@zhongkaifu

You could use Seq2SeqSharp to train GPT-x models only if you have training data set for it

Is importing trained weights from e.g. GPT2 .onnx into Seq2SeqSharp model to avoid training is still part of a long term plan?

@zhongkaifu

You could use Seq2SeqSharp to train GPT-x models only if you have training data set for it

Is importing trained weights from e.g. GPT2 .onnx into Seq2SeqSharp model to avoid training is still part of a long term plan?

y... it's still a long-term plan, but I don't have specific timeline for it.

I actually already chatted with ONNX runtime team a last year, and operator translation between Seq2SeqSharp and ONNX is pretty straightforward, but this is not my urgent task for Seq2SeqSharp for now, because Seq2SeqSharp already supports large model training and fine-tuning for my daily works, and my work is not based on GPT-X models.

Thanks Zhongkai Fu

zhongkaifu commented 1 year ago

I know there is a need for modular and reusable, but it's not high priority and urgent for me right now. This is the reason why I say it's a long-term plan.