zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
193 stars 38 forks source link

Issues to get started with "Seq2SeqClassificationConsole" #67

Closed TodayAI closed 1 year ago

TodayAI commented 1 year ago

I have issues to get started with setting up a demo for the Seq2SeqClassificationConsole app. I assume as a NewBe I do not setup the training data correctly. Can you please point me the way to setup a demo?

I setup two folders for Traning and Validation grafik

In the train folder I placed 2 training files Train01.CLS.snt Train01.SRC.snt

In the validate folder I placed 2 training files Validate01.CLS.snt Validate01.SRC.snt

This are my command line settings: -Task Train -TrainCorpusPath .\Train -ValidCorpusPaths .\Valid -TgtLang CLS -SrcLang SRC -ProcessorType CPU -DecoderType Transformer -EncoderLayerDepth 6

Error in method BuildVocabs at line "Vocab tgtVocab = tgtVocabs[1];" because tgtVocabs with Index 1 does not exist.

public (Vocab, Vocab, Vocab) BuildVocabs(int srcVocabSize = 45000, int tgtVocabSize = 45000, bool sharedVocab = false) [...] (var srcVocabs, var tgtVocabs) = CorpusBatch.GenerateVocabs(srcVocabSize, tgtVocabSize);

        Vocab srcVocab = srcVocabs[0];
        Vocab clsVocab = tgtVocabs[0];
        Vocab tgtVocab = tgtVocabs[1];  // Error position

Content of Train01.CLS.snt: What should I do if I have a sore throat and a runny nose? [SEP] I feel sore in my throat after getting up in the morning, and I still have clear water in my nose. I measure my body temperature and I don’t have a fever. Have you caught a cold? What medicine should be taken. How can I recuperate if my ankle is twisted? [SEP] I twisted my ankle when I went down the stairs, and now it is red and swollen. X-rays were taken and there were no fractures. May I ask how to recuperate to get better as soon as possible. How to diagnose Alzheimer's Caregiving ? [SEP] Now that your family member or friend has received a diagnosis of Alzheimers disease, its important to learn as much as you can about the disease and how to care for someone who has it. You may also want to know the right way to share the news with family and friends. What are the treatments for Alzheimer's Caregiving ? [SEP] Currently, no medication can cure Alzheimers disease, but four medicines are approved to treat the symptoms of the disease. - Aricept (donezepil)for all stages of Alzheimers - Exelon (rivastigmine)for mild to moderate Alzheimers - Razadyne (galantamine)--for mild to moderate Alzheimers - Namenda (memantine)for moderate to severe Alzheimers - Namzarec (memantine and donepezil)for moderate to severe Alzheimers Aricept (donezepil)for all stages of Alzheimers Exelon (rivastigmine)for mild to moderate Alzheimers Razadyne (galantamine) How to diagnose Alzheimer's Caregiving ? [SEP] When you learn that someone has Alzheimers disease, you may wonder when and how to tell your family and friends. You may be worried about how others will react to or treat the person. Others often sense that something is wrong before they are told. Alzheimers disease is hard to keep secret. When the time seems right, be honest with family, friends, and others. Use this as a chance to educate them about Alzheimers disease. You can share information to help them understand what you and the person with Alzheimers are going through. You can also tell them what they can do to help.

Content of Train01.SRC.snt: Otorhinolaryngology Orthopedics Alzheimer1 Alzheimer2 Alzheimer3

zhongkaifu commented 1 year ago

Hi @TodayAI ,

It seems your task is to predict tags of input sentences, so you need to use SeqClassificationConsole tool.

For Seq2SeqClassificationConsole tool is used to not only predict tags, but also generate text according to given input sentences. It's a multi-tasks tool.

Thanks Zhongkai Fu

TodayAI commented 1 year ago

I want to generate text according to the input sentence. I do not know how to train "predict tag and generate text according to given inut sentences". I try to rebuilt the training of your demo on your medial question and answer app. Currently no info is available how to setup training files for the Seq2SeqClassificationConsole.

Maybe I start to understand from your comment. I need to use 3 files for training and not 2 files as I have now:

Vocab srcVocab = srcVocabs[0] -> This is the Title and Text as tokens for the classification' -> Validate01.SRC.snt

Vocab clsVocab = tgtVocabs[0]; --> This is the tags for the tokens -->Validate01.CLS.snt

Vocab tgtVocab = tgtVocabs[1]; -> This is the file for training text generation like the GPT Language model training. I assume each line must fit with the other files -> TrainGPT01.[???????].snt If this is the way to go I have the problem to assign a file type marked with [??????]

zhongkaifu commented 1 year ago

My medical Q&A demo (https://huggingface.co/spaces/zhongkaifu/medical_qa_chs) is based on seq2seq model and there is no tags in it. If you want to output both tags and text, you may choose:

  1. If you use Seq2SeqClassificationConsole, your target data set should look like "Tags \t Output Text". However, your current target data set (*.SRCsnt) only has Tags, but no output text. This is the reason why you get error "tgtVocabs with Index 1 does not exist."
  2. If you use Seq2SeqConsole, you target data set will look like "Tags Output Text".

GPT model only has decoder, so you need to concatenate your input, tag, and output text into a single line (record) for it. For GPT style model, you could use GPTConsole tool for training and test.

TodayAI commented 1 year ago

Thank you I understand now. Is the "Tags \t Output Text" output saved into: Vocab tgtVocab = tgtVocabs[1]; ?

Should I update your documentation once I am done with this?

zhongkaifu commented 1 year ago

Tags vocabularies are saved into tgtVocabs[0], and output text vocabulary are saved into tgtVocabs[1]

Please go ahead for updates.

TodayAI commented 1 year ago

Thank you! With your help I was able to train a demo dataset for the "Seq2SeqClassificationConsole"

grafik

zhongkaifu commented 1 year ago

Good to know it. :)

TodayAI commented 1 year ago

I have a problem to generate text. It only predicts the classification. Output.txt: Otorhinolaryngology <s> </s>

I tried differents console setttings: -DecoderType Transformer -DecoderType GPTDecoder

This are the command line settings: Seq2SeqClassificationConsole.exe -Task Test -InputTestFile Input.txt -OutputFile Output.txt -OutputPromptFile OutputPrompt.txt -ModelFilePath Seq2Seq.Model -ProcessorType CPU -DecoderType Transformer -DecodingStrategy Sampling -PrimaryTaskId 0 -CompilerOptions --use_fast_math -DecoderLayerDepth 2'

How can I setup the console app "Seq2SeqClassificationConsole" to generate text?

Following you see my input.txt on which I want to generate an answer: I have a sore throat and a running nose. What should I do?

This is are the tags of the "CLS" file: Otorhinolaryngology [/t] Otorhinolaryngology Orthopedics [/t] Orthopedics Alzheimer [/t] Alzheimer Alzheimer [/t] Alzheimer Alzheimer [/t] Alzheimer

This is are the tokens for the predicted category: What should I do if I have a sore throat and a runny nose? [SEP] I feel sore in my throat after getting up in the morning, and I still have clear water in my nose. I measure my body temperature and I don’t have a fever. Have you caught a cold? What medicine should be taken. [/t] A runny nose occurs when mucus drips from the nostrils. Sore throat and runny nose often occur together, such as with a cold, COVID-19 infection, or allergies. A runny nose results from mucus trapping germs or other harmful particles and draining out of the nose, ridding the body of the infection or irritant. A sore throat develops from the inflamed throat tissue and mucus dripping down the back of the throat. Viral infections that cause a sore throat and runny nose._

zhongkaifu commented 1 year ago

How did you get your model trained, such as data set, iteration you already have ran and others ? Because text generation needs larger dataset and more iterations during training.

For your test command line, you don't have to specify -OutputPromptFile if you don't have it, and you also don't need to specify -DecoderType, -DecoderLayerDepth, because they are already in your trained model. These parameters are only used for training mode.

I didn't understand your tags of the "CLS" file. Why characters between tab are same ?

For your tokens for the predicted category, it is the text you expect model to output ? Why it includes question and tab[/t] here ?

Can you please explain and share what your task, input data and output data with a few examples, config file and logs for both training and test with me here, and I can understand it clearly.

TodayAI commented 1 year ago

I attached my demo project. I hope the info helps to create a working demo for the "Seq2SeqClassificationConsole" that predicts labels and generates text. Any working example will do. With this knowledge I will suggest an update for your documentation.

Answers for your questions:

  1. The CLS file needs two groups. Oherwise an error is thrown: (The group size in target sentence is '1', but it should be 2) The second group seems not to be used in your code and is not part of the vocabulary.
  2. Input.txt contains the prompt for which a category should be predicted and an answer text should be created
  3. You wrote: For your tokens for the predicted category, it is the text you expect model to output? Yes. In file Train01.SRC.snt the text after Tags \t "Output Text" should be used for generating text.

Seq2SeqClassificationConsole.zip

Command line parameters: Seq2SeqClassificationConsole -Task Train -TrainCorpusPath \Train -ValidCorpusPaths \Valid -TgtLang CLS -SrcLang SRC -ProcessorType CPU -DecoderType Transformer -MaxSrcSentLength 1024 -MaxValidSrcSentLength 1024 -MaxTgtSentLength 512 -MaxValidTgtSentLength 512 -EncoderLayerDepth 6 -CompilerOptions -use_fast_math -MaxEpochNum 100

Seq2SeqClassificationConsole -Task Test -InputTestFile Input.txt -OutputFile Output.txt -OutputPromptFile OutputPrompt.txt -ModelFilePath Seq2Seq.Model -ProcessorType CPU -DecoderType GPTDecoder -DecodingStrategy Sampling -PrimaryTaskId 0 -CompilerOptions --use_fast_math -TrainCorpusPath \Train -ValidCorpusPaths \Valid -TgtLang CLS -SrcLang SRC

Seq2SeqClassificationConsole -Task DumpVocab -ModelFilePath Seq2Seq.Model -SrcVocab ScrVocab.txt -TgtVocab TgtVocab.txt -ClsVocab ClsVocab.txt -TrainCorpusPath \Train -ValidCorpusPaths \Valid -TgtLang CLS -SrcLang SRC -ProcessorType CPU -DecoderType Transformer

zhongkaifu commented 1 year ago

I just checked your dataset, and it has some format issue there. The .SRC.snt file should keep questions only, and the .CLS.snt (I suggest you rename it to TGT.snt) should keep both tag and predicted sentences. So the correct dataset looks like:

Train01.SRC.snt: What should I do if I have a sore throat and a runny nose? How can I recuperate if my ankle is twisted?

Train01.CLS.snt (CLS is ambiguous, so suggest it to rename to TGT) Otorhinolaryngology [\t] I feel sore in my throat after getting up in the morning, and I still have clear water in my nose. I measure my body temperature and I don’t have a fever. Have you caught a cold? What medicine should be taken. A runny nose occurs when mucus drips from the nostrils. Sore throat and runny nose often occur together, such as with a cold, COVID-19 infection, or allergies. A runny nose results from mucus trapping germs or other harmful particles and draining out of the nose, ridding the body of the infection or irritant. A sore throat develops from the inflamed throat tissue and mucus dripping down the back of the throat. Viral infections that cause a sore throat and runny nose. Orthopedics [\t] I twisted my ankle when I went down the stairs, and now it is red and swollen. X-rays were taken and there were no fractures. May I ask how to recuperate to get better as soon as possible. Orthopaedic disorders are the second largest source of disability globally, with low back pain being the single leading cause of disability. Approximately 1.71 billion people have musculoskeletal conditions worldwide. Musculoskeletal conditions can affect people of all ages and most commonly impact those in adolescence and older age. These conditions affect the ability to lead active lives and have a negative effect on economic productivity. The burden of musculoskeletal disease is expected to rise as populations age. These conditions place a considerable strain on sufferers, their carers, and on healthcare systems in general. They also keep large numbers of people from being fully active members of society. Given the demographic challenge of an ageing population, Europe will be particularly impacted by these conditions in the coming years.

The model will build vocabularies from your dataset.

The model trained by Seq2SeqClassificationConsole is an Encoder-Decoder model, so when you run "test", you don't have to specify -DecdoerType, because the tool can get model type from model file. If your question is already in -InputTestFile, you don't need to specify -OutputPromptFile "-OutputPromptFile" is a part of your answer that prompt model to generate the rest of the answer.

If you want to train your model as GPT type model, you need to use GPTConsole tool for both training and test rather than Seq2SeqClassificationConsole, and put all your questions, tags and answers in a single line.

TodayAI commented 1 year ago

I changed the demo as you suggested and it compiles without issues. Training of the model is currently running.

You wrote: "If you want to train your model as GPT type model, you need to use GPTConsole tool for both training and test rather than Seq2SeqClassificationConsole, and put all your questions, tags and answers in a single line." -> I tested your demo "Chinese fiction writer" on Huggngface. -> How did you get the model to predict sentences in correct grammer with the GPTDecoder? I saw that your demo corrects itself in a new reply and fixes grammer issues itself. How did you setup the training data and the GPTDecoder? -> Did you simply train on a large text corpus or did you use an insruction based text corpus. Did you do something special to prepare the dataset?

I ask because I do not understand fully how to prepare the data for the GPTDecoder. I only know that I need to prepare one sentence per line. How do you maintain logical text parts that belong to one logical text section? I saw while testing GPTDecoder with my own model it will use text vocabulary that does not belong to the current topic.

zhongkaifu commented 1 year ago

For GPT type model, I just simply trained it on a large text corpus using GPTConsole tool. For your task, the input to your training would looks like as follows (question + tag + answer) before it get tokenized: How can I recuperate if my ankle is twisted? Otorhinolaryngology I feel sore in my throat after getting up in the morning, and I still have clear water in my nose. I measure my body temperature and I don’t have a fever. Have you caught a cold? What medicine should be taken. A runny nose occurs when mucus drips from the nostrils. Sore throat and runny nose often occur together, such as with a cold, COVID-19 infection, or allergies. A runny nose results from mucus trapping germs or other harmful particles and draining out of the nose, ridding the body of the infection or irritant. A sore throat develops from the inflamed throat tissue and mucus dripping down the back of the throat. Viral infections that cause a sore throat and runny nose.

For test, you just send questions, such as "How can I recuperate if my ankle is twisted?", to the model, and let model to predict the rest of the text.

TodayAI commented 1 year ago

For the GPTConsole tool I seperate (question + tag + answer) with tabs [/t]?

zhongkaifu commented 1 year ago

GPTConsole tool doesn't require tabs [/t].

TodayAI commented 1 year ago

With your suggestions I get the console apps up and running. I am testing on different text corpus. However I needed to make some small changes to data preparation and tokenization. This weekend I will start to train a large text corpus with my underdog Hardware analog to GPT-2 with OpenWebText. For this I purchased a used Jetson Nano 2GB. Are you interested that we place my changes to the tokenizer back on your original sources to work out of the box with OpenWebText?

zhongkaifu commented 1 year ago

With your suggestions I get the console apps up and running. I am testing on different text corpus. However I needed to make some small changes to data preparation and tokenization.

This weekend I will start to train a large text corpus with my underdog Hardware analog to GPT-2 with OpenWebText. For this I purchased a used Jetson Nano 2GB.

Are you interested that we place my changes to the tokenizer back on your original sources to work out of the box with OpenWebText?

Any contribution is always welcome. :) You could send out pull request for it.

Are you going to use Jetson Nano 2GB for training or just for inference ?

TodayAI commented 1 year ago

Yes, lets work classical with pull requests. Then you can decide if you want to include the suggested code. I am new to Jetson Nano. In the first step I use it for training.

Can training data of the GPTDecoder use "position embeddings", "segment embeddings", and "tag embeddings"?

You wrote above for the GPTConsole: For your task, the input to your training would looks like as follows (question + tag + answer)

Does it make sense to use embeddings in the training data? I ask because I have a large number of supervised text prompts. Example: How to book a flight to Los Angeles? [Multiple supervised answers... ]

I want to use it for prompt engineering and create "seed prompts" to help the GPTDecoder with the correct context to generate an answer.

zhongkaifu commented 1 year ago

Yes, lets work classical with pull requests. Then you can decide if you want to include the suggested code.

I am new to Jetson Nano. In the first step I use it for training.

Can training data of the GPTDecoder use "position embeddings", "segment embeddings", and "tag embeddings"?

You wrote above for the GPTConsole:

For your task, the input to your training would looks like as follows (question + tag + answer)

Does it make sense to use embeddings in the training data? I ask because I have a large number of supervised text prompts. Example: How to book a flight to Los Angeles? [Multiple supervised answers... ]

I want to use it for prompt engineering and create "seed prompts" to help the GPTDecoder with the correct context to generate an answer.

These embeddings are pretty general and they can be used for any types of models. Segment embeddings and tag embeddings are optional, you can decide if you want to enable or disable them.

The training set could looks like this: Question1 + Answer1 Question1 + Answer2 ... Question1 + AnswerN Question2 + Answer1 ... QuestionX + AnswerY

Each question could have the different number of answers.

How large the model you want to train ? I'm afraid that Jetson Nano 2GB may not have enough performance and memory for larger model.

TodayAI commented 1 year ago

Thank you for the tips on data preparation. I will follow your suggestions.

So the binary modell with all hashes is kept in memory during training. So I will try to train with a virtual linux server with 12 cores and 64 GB RAM for OpenWebText?

zhongkaifu commented 1 year ago

Thank you for the tips on data preparation. I will follow your suggestions.

So the binary modell with all hashes is kept in memory during training. So I will try to train with a virtual linux server with 12 cores and 64 GB RAM for OpenWebText?

It depends on how large your model is. It would be higher efficiency if you could have GPU for training even if it's just RTX series for desktop and gaming.

TodayAI commented 1 year ago

I will train a large language model (OpenWebText) on RTX. Can I ask some setup questions:

  1. How should I handle punctuation in the training data? Should I add spaces? Example:

Yesterday , I went into the city . Gorge said : " Please help me ! "

zhongkaifu commented 1 year ago

You need to run tokenizer against your data set at first, such as SentencePiece, and then use tokenized data set for training and test.

The model size (such as the number of layers) you will choose depends on lots of factors, such as the size of your data set, your computing capacity and others.

GPT-2 has many models with different size. 12 layers is GPT-2 small model. For GPT-3, I don't think your single RTX GPU is able to train it.

I don't understand "I will use this (english) GPT model for fine tuning." Are you going to train a new GPU model from the scratch, or you just fine-tune a public published GPT model ?

Embedding is built-in the model and you will use (or update) it.

TodayAI commented 1 year ago

I want to try to built a 12 or 36 layer GPT model with seq2seqSharp on OpenWebText. I will rent the required hardware for this. I know from nanoGPT it needs 38 hours of training for a 12 layer GPT-2 model on one computer mit 8 x A100 graphic units to train a similar model. I assume that your code is more complex and so it will need more training time.

I think that seq2seqSharp offers no function to import an existing gpt language model from huggingface trained on englisch language?

I will use sentencepiece to prepare the training data. I will add spaces in front of all punctuation.

zhongkaifu commented 1 year ago

Different hyper-parameters could also affect training time a lot, such as batch size, update frequency. You may need to take some time to adjust it.

No, but you could use Microsoft.ML.OnnxRuntime to load existing trained ONNX GPT model from huggingface or somewhere, and export tensors to Seq2SeqSharp.

TodayAI commented 1 year ago

I rented a windows based gpu server and found a good offer for this. I consider to write a booklet on advanced language models using net/c# based on seq2seqSharp and my own native c# code like the hidden markov. But we should chat about this privately. Because I am part of a crowdy community I might attract Semi professional users that drive you crazy with feature requests. So we need to consider if it is better if I fork a github sub project. Please give it some thoughts.

"you could use Microsoft.ML.OnnxRuntime to load existing trained ONNX GPT model from huggingface or somewhere, and export tensors to Seq2SeqSharp" -> that's a pretty cool idea! I have a ONNX bert model in pure ml.net up and running.

zhongkaifu commented 1 year ago

I rented a windows based gpu server and found a good offer for this. I consider to write a booklet on advanced language models using net/c# based on seq2seqSharp and my own native c# code like the hidden markov. But we should chat about this privately. Because I am part of a crowdy community I might attract Semi professional users that drive you crazy with feature requests. So we need to consider if it is better if I fork a github sub project. Please give it some thoughts.

"you could use Microsoft.ML.OnnxRuntime to load existing trained ONNX GPT model from huggingface or somewhere, and export tensors to Seq2SeqSharp"

-> that's a pretty cool idea! I have a ONNX bert model in pure ml.net up and running.

Yes, feel free to fork Seq2SeqSharp. The development mode is pretty flexible and you can choose any way you like. :)

My email is fuzhongkai@gmail.com We can chat about your book by email.

TodayAI commented 1 year ago

Ok. I will contact you privately.

TodayAI commented 1 year ago

Do I understand the process to tokenize the training data correctly: Seq2seqSharp reads all available training files and tokenizes the text corpus? This seems not practical on 7 Million training files of OpenWebText. So I need to split the training data on multiple training runs? Is this correct? Will seq2seqSharp modify a previously trained model with the new training run data?

Splitting the trainings data into multiple training runs by modifying the language model with each run would also allow better memory management on a large training corpus?

Can I ask another question on fine tuning a pre-trained GPT model on "questions and answers" datasets? Please see the following simple examples. How would you setup the training data for the GPTConsole?

1.) What is the capital of France? [SEP] The capital of France is Paris.

  1. ) Mary goes into the kitchen. Mary goes into the living room. Where is Mary now? [SEP] Mary is in the living room.

3.) Oracle Analytics Cloud offers two sizing options. What are the sizing options offered by Oracle Analytics Cloud? [SEP] Oracle Analytics Cloud offers two sizing options.

zhongkaifu commented 1 year ago

For training, Seq2SeqSharp accepts tokenized data set as input, so you need to choose and call tokenizer for your task. For example: If you use SentencePiece as tokenizer, you need to

  1. Train a SentencePiece model or download an existing SentencePiece model
  2. Tokenize your raw data set using the above SentencePiece model
  3. Send tokenized data set as input to Seq2SeqSharp for trianing

For test, Seq2SeqSharp already integrated SentencePiece APIs, so you could send your raw data set as input to Seq2SeqSharp. You need to specify parameters "SrcSentencePieceModelPath" and "TgtSentencePieceModelPath" to point to SentencePiece models, and then Seq2SeqSharp will use it.

Most GPU memory cost is caused by forward-backward of the model and weights update. Training data are not kept in GPU memory expect current mini-batch, so splitting the training data won't help for it unless main memory and GPU memory are shared, such as Nvidia Jetson

For your examples, they look good if the model is used for "questions and answers" task. Again, you need to tokenize them before using them as input for training

TodayAI commented 1 year ago

The process to train a gpt-2 model is well known. For details see here

I try to figure out how to setup the gptConsole to replicate a gpt-2 model on OpenWebText with seq2seqsharp. From your comments I learned so far:

1. Setup the training data All 7 million text files need to be trained in one training run. GPTConsole will not eat up 6 GB GPU RAM, as with trsining a default gpt-2 model because it feeds the GPU with chunks of data. The reason is propablely that the complete vocabulary needs to be known on training. It is not known if I can split the training into multiple runs when I feed the full vocabulary to seq2seqSharp?

2. Training parameters Can be used as in other gpt-2 models.

3. Training time duration No infos available so far on training OpenWebText on special hardware.

4. Data setup for fine tuning a pre-trained gpt-2 model The data setup for fine tuning a "question and answer" dataset for a pre-trained gpt-2 model is described above.

If it is possible to fine tune a pre-trained gpt-2 model is not known? It is little confusing for me that it seems not possible to spilt up training into multiple runs one after another. Generally it should be possible to train data chunks in multiple training runs when the text corpus chunks are of equal generalization quality as the other text chunks.

zhongkaifu commented 1 year ago

You could split the tokenized training set into multiple runs with full vocabulary, but why you need to do it? It won't save GPU memory for you. With the proper hyperparameters, you can fine-tune a pre-trained gpt-2 model by Seq2SeqSharp. As the webpage you mentioned introduced GPT-2 small model (layer=12, dim=768), here is an config example for you to refer.

{
  "DecoderLayerDepth": 12,
  "DecoderStartLearningRateFactor": 1.0,
  "DecoderType": "GPTDecoder",
  "EnableCoverageModel": false,
  "IsDecoderTrainable": true,
  "IsSrcEmbeddingTrainable": true,
  "IsTgtEmbeddingTrainable": true,
  "MaxValidSrcSentLength": 2048,
  "MaxValidTgtSentLength": 2048,
  "MaxSrcSentLength": 2048,
  "MaxTgtSentLength": 2048,
  "SeqGenerationMetric": "BLEU",
  "SharedEmbeddings": true,
  "SrcEmbeddingDim": 768,
  "TgtEmbeddingDim": 768,
  "ExpertNum": 1,
  "ExpertsPerTokenFactor": 1,
  "PointerGenerator": false,
  "MaxTokenSizePerBatch": 5500,
  "AMP": false,
  "SaveGPUMemoryMode": true,
  "BeamSearchSize": 1,
  "Beta1": 0.9,
  "Beta2": 0.98,
  "LossType": "CrossEntropy",
  "CompilerOptions": "--use_fast_math --include-path=.",
  "ConfigFilePath": "",
  "DecodingStrategy": "GreedySearch",
  "DecodingTopPValue": 0.0,
  "DecodingRepeatPenalty": 2.0,
  "DecodingDistancePenalty": 5.0,
  "DeviceIds": "0",
  "TaskParallelism":1,
  "DropoutRatio": 0.0,
  "EnableSegmentEmbeddings": false,
  "MaxSegmentNum": 16,
  "EncoderLayerDepth": 6,
  "EncoderStartLearningRateFactor": 1.0,
  "EncoderType": "Transformer",
  "GradClip": 5.0,
  "HiddenSize": 768,
  "IsEncoderTrainable": true,
  "MaxEpochNum": 100,
  "MemoryUsageRatio": 0.999,
  "ModelFilePath": "gpt2_small.model",
  "MultiHeadNum": 12,
  "NotifyEmail": "",
  "Optimizer": "Adam",
  "ProcessorType": "GPU",
  "SrcLang": "xxx",
  "LearningRateStepDownFactor":0.8,
  "StartLearningRate": 0.001,
  "ShuffleType": "NoPadding",
  "Task": "Train",
  "TooLongSequence": "Truncation",
  "TgtLang": "tgt",
  "TrainCorpusPath": "./data/train",
  "UpdateFreq": 2,
  "ValMaxTokenSizePerBatch": 20480,
  "ValidCorpusPaths": null,
  "WarmUpSteps": 8000,
  "WeightsUpdateCount": 0,
  "StartValidAfterUpdates": 10000,
  "SrcVocabSize": 90000,
  "TgtVocabSize": 90000,
  "MinTokenFreqInVocab": 1,
  "SrcVocab": null,
  "TgtVocab": null,
  "ActivateFunc": "Relu",
  "EnableTagEmbeddings": false
}

Some parameters in above need to be modified according to your task and environment. Here are some examples: Modify "MaxTokenSizePerBatch" and "UpdateFreq" values according to your GPU memory size. Modify "--include-path" in "CompilerOptions" to point to the folder of installed CUDA SDK in your machine. Modify "TgtLang" to match the file name of your data set. The pattern of data file name is %main_file_name%.%TgtLang%.snt Modify "TrainCorpusPath" to point the folder of your training set

TodayAI commented 1 year ago

Thank you very much. That's a good starting point.