zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
193 stars 38 forks source link

SeqClassification Validation #58

Closed clm33 closed 1 year ago

clm33 commented 1 year ago

Dear Zhongkaifu:

I am trying to validate a model trained on a sequence-classification task, but when trying to execute the program, the following error appears:

"Task 'Valid' is not supported"

In the Usage: SeqClassificationConsole [parameters...] section that pops out after the error, 'Valid' appears as a proper value to the "Task" parameter so I do not understand why it does not work.

I hope you can shed some light on my issue.

zhongkaifu commented 1 year ago

Hi @clm33,

If you check file "Seq2SeqSharp\Tools\SeqClassificationConsole\Program.cs" and then you will find I commented out this code for "Valid" test. I forget why I commented these code out, but you can try to add it back, rebuild the project, and run "Valid" test. Let me know if you get any problem on it.

In addition, just a suggestion, for metrics, you may choose the proper metrics for your task, such as "SequenceLabelFscoreMetric" for F1 score or implement new metrics for your task.

Thanks Zhongkai Fu

clm33 commented 1 year ago

Thanks for answering. I will try to sort it out and let you know.

zhongkaifu commented 1 year ago

okay, I just updated the code and add "Valid" back to SeqClassification Tool. However, since I don't have trained SeqClassification model for now, I didn't test it.

You can either 1) pull the latest code from repo, or 2) copy and paste that code I modified to your project and run it. Let me know if you have any problem.

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Dear Zhongkai:

is the "opts.ShuffleBlockSize" the same as "opts.ValMaxTokenSizePerBatch" in your new "valid" code?

Thanks a lot

zhongkaifu commented 1 year ago

@piedralaves

In the latest code, Seq2SeqSharp does not have "opts.ShuffleBlockSize" anymore, because I use a new way to shuffle data set without any memory limitation problem.

For "opts.ValMaxTokenSizePerBatch", it means during validation, how many tokens at most can be sent to network in a mini-batch.

Thanks Zhongkai Fu

piedralaves commented 1 year ago

And, how many tokens do you recommend in opts.ValMaxTokenSizePerBatch?

We are trying to reconstruct the "valid" stament.

Thanks

piedralaves commented 1 year ago

May be 5120?

zhongkaifu commented 1 year ago

It depends on your GPU memory size. You could set it to 5120 and try it. If you got OOM problem, then reduce it to a smaller value.

For "valid", this value only affects speed performance and no impact on quality, so it's safe to try it using different values.

Thanks Zhongkai Fu

piedralaves commented 1 year ago

This is our code:

` else if (opts.Task == ModeEnums.Valid) { Logger.WriteLine($"Evaluate model '{opts.ModelFilePath}' by valid corpus '{opts.ValidCorpusPaths}'");

                // Create metrics

                ss = new SeqClassification(opts);
                Dictionary<int, List<IMetric>> taskId2metrics = new Dictionary<int, List<IMetric>>();
                for (int i = 0; i < ss.ClsVocabs.Count; i++)
                {
                    taskId2metrics.Add(i, new List<IMetric>());
                    taskId2metrics[i].Add(new MultiLabelsFscoreMetric("", ss.ClsVocabs[i].GetAllTokens(keepBuildInTokens: false)));
                }

                ss = new SeqClassification(opts);
                ss.EvaluationWatcher += Ss_EvaluationWatcher;

                // Load valid corpus
                //Seq2SeqCorpus validCorpus = new Seq2SeqCorpus(opts.ValidCorpusPaths, opts.SrcLang, opts.TgtLang, opts.ValBatchSize, opts.ShuffleBlockSize, opts.MaxTestSentLength, opts.MaxTestSentLength, shuffleEnums: opts.ShuffleType, tooLongSequence: opts.TooLongSequence);
                var validCorpus = new SeqClassificationMultiTasksCorpus(opts.ValidCorpusPaths, srcLangName: opts.SrcLang, tgtLangName: opts.TgtLang, opts.MaxTokenSizePerBatch, opts.MaxTestSentLength, shuffleEnums: opts.ShuffleType, tooLongSequence: opts.TooLongSequence);

                ss.Valid(validCorpus, taskId2metrics, null);

            }`

Changing the value, the problem is that sometimes an OOM error arises and sometimes no error arises but no result for F: info,05/02/2023 11:15:45 Metrics result on task '0' on data set 'valid': MultiLabelsFscore_ = The number of categories = '0'

Our logs:

SeqClassificationConsole_Valid_2023_02_05_11h_06m_01s.log

Is it something wrong with our code?

Thanks a lot

zhongkaifu commented 1 year ago

@piedralaves What your data format looks like ? Can you please share a few examples of it ?

Thanks Zhongkai Fu

clm33 commented 1 year ago

descriptions.cla.snt.txt descriptions.sam.snt.txt

The format is one utterance and one cateogry per row. For introducing them into the RNN they must be in different files, so I upload a .txt with same examples so you can see. The file names were added the ".txt" so that they could be uploaded. When introducing them into the RNN the names are descriptions.cla.snt for the categories and descriptions.sam.snt for the utterances

zhongkaifu commented 1 year ago

Here is the latest code I checked-in for SeqClassifcation validation. You can pull it to your project. Note that since SeqClassifcation supports multi-tasks, it uses "ValidCorpusPaths" in config file rather than "ValidCorpusPath" and each task has separated validation file. Can you please check if your config file is correct ?

            else if (opts.Task == ModeEnums.Valid)
            {
                 Logger.WriteLine($"Evaluate model '{opts.ModelFilePath}' by valid corpus '{opts.ValidCorpusPaths}'");

                // Create metrics
                ss = new SeqClassification(opts);
                Dictionary<int, List<IMetric>> taskId2metrics = new Dictionary<int, List<IMetric>>();

                for (int i = 0; i < ss.ClsVocabs.Count; i++)
                {
                    taskId2metrics.Add(i, new List<IMetric>());
                    taskId2metrics[i].Add(new MultiLabelsFscoreMetric("", ss.ClsVocabs[i].GetAllTokens(keepBuildInTokens: false)));
                }

                ss = new SeqClassification(opts);
                ss.EvaluationWatcher += Ss_EvaluationWatcher;

                // Load valid corpus
                if (!opts.ValidCorpusPaths.IsNullOrEmpty())
                {
                    string[] validCorpusPathList = opts.ValidCorpusPaths.Split(';');
                    foreach (var validCorpusPath in validCorpusPathList)
                    {
                        Logger.WriteLine($"Loading valid corpus '{validCorpusPath}'");
                        var validCorpus = new SeqClassificationMultiTasksCorpus(validCorpusPath, srcLangName: opts.SrcLang, tgtLangName: opts.TgtLang, opts.ValMaxTokenSizePerBatch, opts.MaxSentLength, shuffleEnums: opts.ShuffleType, tooLongSequence: opts.TooLongSequence);

                        Logger.WriteLine($"Validating corpus '{validCorpusPath}'");
                        ss.Valid(validCorpus, taskId2metrics, null);
                    }
                }          
            }
clm33 commented 1 year ago

Yes, we are using the "ValidCorpusPaths" parameter in the configuration file.

piedralaves commented 1 year ago

Hi Zhongkai:

We are revising our code. One of the posible problem is that when the constructor of SeqClassification load a pretrained model from a file, that is:

ss = new SeqClassification(opts);

Items List were not being loaded and only IndexToWord and WordToIndex are full.

This issue was affecting to the incremental as well to the valid.

We find a posible solution just loading again the Items when SeqClassification(opts) is called:

if (File.Exists(m_options.ModelFilePath))
            {
                if (srcVocab != null || clsVocabs != null)
                {
                    throw new ArgumentException($"Model '{m_options.ModelFilePath}' exists and it includes vocabulary, so input vocabulary must be null.");
                }

                m_modelMetaData = LoadModelImpl_WITH_CONVERT(CreateTrainableParameters);
                //m_modelMetaData = LoadModelImpl();
                //---LoadModel_As_BinaryFormatter( CreateTrainableParameters );

                //loading the items again

                for (int n = 0; n < this.ClsVocabs.Count; n++)
                {
                    for (int k = 0; k < this.ClsVocabs[n].IndexToWord.Count; k++)
                    {
                        this.ClsVocabs[n].Items.Add(this.ClsVocabs[n].IndexToWord[k]);
                    }

                }

                // //loading the items again
                for (int k = 0; k < this.SrcVocab.IndexToWord.Count; k++)

                {
                    this.SrcVocab.Items.Add(this.SrcVocab.IndexToWord[k]);
                }

            }

It is posible that this issue were only in our code and not in yours. Sorry about that. Remember that we manipulated your code to load text2vec embeddings again for research reasons.

Anycase, we apreciate yor help very much and follow our revision.

Thanks a lot

zhongkaifu commented 1 year ago

@piedralaves

It looks good to me. Let me know if you have any further questions.

In addition, are you using the latest code from the repo ? If not, just curious, why?

Thanks Zhongkai Fu

piedralaves commented 1 year ago

No, not the last one. We are using a recent version in which we made available again the functionality of loading text2vec embeddings. We are also researching about classical embeddings and we need such functionality.

https://github.com/zhongkaifu/Seq2SeqSharp/issues/50

For this reason, your advises are very valuable.