Sudden high increase in memory consumption while training a seq2seq model and validation happens

zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.

Other

200 stars 37 forks source link

Sudden high increase in memory consumption while training a seq2seq model and validation happens #72

Closed zsogitbe closed 1 year ago

zsogitbe commented 1 year ago

Description While training a seq2seq model the GPU memory consumption jumps from 5 GB to 11 GB when the validation starts. This suggests that the model and all memory content is copied in GPU memory for the validation. Because of this validation crashes with the following error: Exception: 'ErrorOutOfMemory: The API call failed because it was unable to allocate enough memory to perform the requested operation.'

How to Reproduce Train a simple seq2seq model and monitor memory consumption when validation starts.

Expected behavior The GPU memory consumption should not increase while doing the validation. Example options to solve this problem: if possible release GPU memory before the validation starts, temporarily release GPU memory for training (save the model?), do the validation on CPU, etc.

zhongkaifu commented 1 year ago

Updated code to force call GC.Collect() after each mini batch get processed. Please pull the latest code and try it out.

zsogitbe commented 1 year ago

Thank you for trying!

Unfortunately it does not improve the situation. I have created a small validation set with 1000 lines only (the model size is 400 MB). The validation uses 2.7 GB of GPU memory. The GPU memory is not released when the validation is done only when the app exits.
The used GPU memory strongly depends on the size of the validation set. With thousands of lines it becomes 6-7 GB GPU memory.
I find this a bit strange because it should only use the input size (if batch = 1, then it is only 1 sentence), evaluate it, save the result and then load the next sentence. With a higher batch size I can image that it can evaluate several lines at once but the batch size is never so high to explain such a high memory usage.
Most of the memory is allocated before this message 'Finished to build index for data set'. It seems that it builds some kind of huge memory 'model' for the data. It is huge because it uses nearly 3 GB of GPU memory for only 1000 lines of validation data (the model size is 400 MB).
This is the reason for why it is impossible to have validation during training on low GPU memory devices.

It would be nice if you could find a magic solution to this problem, but I am not sure if the internal logic of the input data will allow it. Very simply put, there should be an option that we can just validate 1 sentence with the model in memory and with a very small memory footprint.

zhongkaifu commented 1 year ago

Which options did you use to set batch size for validation ? Can you please try the following two options in your command line or config file ?

"ValMaxTokenSizePerBatch" : 1 // How many tokens in a mini-batch during validation. You could set it to 1 "MaxValidTgtSentLength": 32 //How many tokens you want the seq2seq model to output.

For validation or test, it will consume pretty less memory than training. How much GPU memory do you have? For lower GPU memory, you could also try options "AMP": true to ask the Seq2SeqSharp use Float16 type rather than Float32 type, and then it will save half of GPU memory usage.

zsogitbe commented 1 year ago

I will try. What is exactly ValMaxTokenSizePerBatch? Will this be 1 word or 1 sentence?

Can we convert a Float32 trained model to Float16 model?

zhongkaifu commented 1 year ago

"ValMaxTokenSizePerBatch": It's the estimated number of tokens in a mini-batch during validation. The larger value is, the more tokens (and sentences) will be included into a single mini-batch. You could check code here: https://github.com/zhongkaifu/Seq2SeqSharp/blob/69e644fad7aacbc7875c72fb4966361fb476d9e8/Seq2SeqSharp/Corpus/ParallelCorpus.cs#L353

Yes, by default, models are saved to disk in Float32 type, but if you enable "AMP" option, Seq2SeqSharp will load and use it in Float16 type.

zsogitbe commented 1 year ago

That is clever! What about setting AMP before the training, will it save the model in Float16 then to decrease model size?

zhongkaifu commented 1 year ago

Yes, you can set AMP for training, then Seq2SeqSharp will keep model parameters and run network using Float16 type.

zsogitbe commented 1 year ago

It seems that the GPU memory is not released after a batch is processed and everything stays in GPU memory until the app is done.

zhongkaifu commented 1 year ago

That's because Seq2SeqSharp uses GPU memory pools for high performance memory allocation and release. The Seq2SeqSharp supports two types of GPU memory pools and basic memory allocation strategy as follows, and you can set "CudaMemoryAllocatorType" option in config file or command line to specify which types of memory allocator type you want to use.

"CudaMemoryPool": This is CUDA built-in memory pool implemented by Nvidia. It's able to dynamically adjust GPU memory usage.

"CustomMemoryPool": This is a GPU memory pool implemented by myself. You can set "MemoryUsageRatio" option to tell Seq2SeqSharp how much percentage GPU memory you want to be allocated for the pool.

"Basic": This is the option that do not use any memory pool. It will allocate GPU memory on demand, but its performance will be much slower than other two memory pool options.

You could try any of these three options and choose the one works for you.

All memory pool will be allocated at the beginning of the application and release it when the application is done.

zsogitbe commented 1 year ago

I have run a test with seq2seq validation of 1000 sentences: 1:57 min CudamemoryPool 2.8 GB GPU memory 2:23 min Basic 1.2 GB GPU memory 2:01 min CustomMemoryPool 4.4 GB GPU memory

So, you were right about the reason. The Basic is slower, but uses GPU memory much more efficiently concerning the size of the memory (more reallocation what makes it 18% slower). If we scale this up for a training task by the factor of 3000 (5 days training - just an estimate), then we get approximately 18 hours slower if we use the Basic. I would however look at the CustomMemoryPool implementation because it uses much more memory than the CudamemoryPool and it is not faster.

zhongkaifu commented 1 year ago

Please note that "Basic" mode will be much slower than memory pool types especially for training, because training runs both forward and backward steps and it usually process much larger dataset than validation.

For CustomMemoryPool, you could set "MemoryUsageRatio" to control how much memory you want to use.