zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
193 stars 38 forks source link

ManagedCuda.CudaException: ErrorOutOfMemory: The API call failed because it was unable to allocate enough memory to perform the requested operation #7

Closed GeorgeS2019 closed 4 years ago

GeorgeS2019 commented 4 years ago

Can you share your experience, perhaps in order of which parameters to adjust, that would limit the change of not running out of GPU memory.

Below is one use case which leads to the problem. Seq2SeqConsole.exe -TaskName Train -WordVectorSize 512 -HiddenSize 512 -StartLearningRate 0.002 -EncoderLayerDepth 6 -DecoderLayerDepth 2 -TrainCorpusPath "../../corpus" -ModelFilePath "../../trainings/seq2seq512.model" -SrcLang chs -TgtLang enu -ProcessorType GPU -DeviceIds 0 -MaxEpochNum 2 -EncoderType Transformer -BatchSize 32 -MultiHeadNum 8 -MaxSentLength 128

zhongkaifu commented 4 years ago

What's the size of your GPU memory ? You can reduce the values of "-WordVectorSize" and "-HiddenSize" to train a smaller model, or reduce the batch size "-BatchSize".

In addition, what the size of your source embedding and target embedding ?

GeorgeS2019 commented 4 years ago

HI, thanks for helping, I am still learning. I understand c# better than python and this is perhaps a good start to learn. I found out this repository from SciSharp. Great job to people of this community.

My corpus has sentence not more than 128 characters. Will this determine the WordVectorSize, HiddenSize? Is 128 the minimum value for "-WordVectorSize" and "-HiddenSize" ?

GeorgeS2019 commented 4 years ago

Only 4GB, After this exercise, I realize how limited 4GB for learning NLP. I am reducing the "-BatchSize" to 8, after trial and error. Is there some guideline how to determine these parameters without spending hours adjusting and just before the epoch is over, the console program fails?

GeorgeS2019 commented 4 years ago

what the size of your source embedding and target embedding ? I did not set, I guess it is default, what can I do about that.?

zhongkaifu commented 4 years ago

The model size is determined by many different things, such as the number of layers, the hidden size of each layer, the size of vocabulary in both source side and target side and so on. I suggest you start to try smaller models at first, such as set -HiddenSize/-WordVectorSize to 64 or 128, set -EncoderLayerDepth/-DecoderLayerDepth to 2, use smaller vocabulary or training corpus and so on.

For -MaxSentLength, what's the average length of sentences in your training corpus ? I also suggest your try smaller value at first, such as 32.

After you have your first model, you can try different hyper-parameters for your next expereiments.

GeorgeS2019 commented 4 years ago

HI Fu Zhongkai, I have tried keep trimming down the parameters, the training gets gradually running longer and yet it crashes later.

This Seq2SeqSHARP of your is first of its kind to serve community as an alternative to ML.NET.

It just needs more feedback, more contributions from community to keep sorting out to improve it for more robustness, even for small GPU RAM use case like myself :-) e.g. if it is possible to cap the GPU RAM consumption, with support from, e.g. ManageCuda

Again, I appreciate you and guys pushing SciSharp ... and obviously your generous support for the .NET community.

I will be switching to a machine with a bigger GPU RAM e.g. 8GB, great to know the limit of small GPU RAM for the last few days, and then to get back to you :-).

zhongkaifu commented 4 years ago

Thanks @GeorgeS2019 . If you want to continue debug it, you can share your log file with me and I'd like to take a look. It looks like "Seq2SeqConsole_{time stamp}.log

You can just share a few lines in the beginning of the log file, which includes many key information of your training. It looks like: D:\Seq2SeqSharp>Seq2SeqConsole.exe -ConfigFilePath train_opts.json info,11/18/2019 2:41:59 PM Seq2SeqSharp v2.0 written by Zhongkai Fu(fuzhongkai@gmail.com) info,11/18/2019 2:41:59 PM Command Line = '-ConfigFilePath train_opts.json' info,11/18/2019 2:41:59 PM Loading config file from 'train_opts.json' info,11/18/2019 2:41:59 PM Loading corpus from 'corpus_ek' for source side 'ENU' and target side 'CHS' MaxSentLength = '64', addBOSEOS = 'True' info,11/18/2019 2:41:59 PM Loading corpus from 'corpus_valid' for source side 'ENU' and target side 'CHS' MaxSentLength = '64', addBOSEOS = 'True' info,11/18/2019 2:41:59 PM Building vocabulary from given training corpus. info,11/18/2019 2:41:59 PM Shuffling corpus... info,11/18/2019 2:42:12 PM Shuffled '1172687' sentence pairs to file 'D:\Seq2SeqSharp\xe2i0esc.gbq' and 'D:\Seq2SeqSharp\ajgwcdmr.tc0'. warn,11/18/2019 2:42:12 PM Found 7 sentences are longer than '64' tokens, ignore them. info,11/18/2019 2:42:30 PM Source language Max term id = '205600' info,11/18/2019 2:42:30 PM Target language Max term id = '109761' info,11/18/2019 2:42:30 PM Creating decay learning rate. StartLearningRate = '0.002', WarmupSteps = '16000', WeightsUpdatesCount = '0' info,11/18/2019 2:42:31 PM Creating Adam optimizer. GradClip = '5' info,11/18/2019 2:42:31 PM Initialize device '0' Precompiling GatherScatterKernels Precompiling IndexSelectKernels Precompiling ReduceDimIndexKernels Precompiling CudaReduceAllKernels Precompiling CudaReduceKernels Precompiling ElementwiseKernels Precompiling FillCopyKernels Precompiling AdvFuncKernels Precompiling SpatialMaxPoolKernels Precompiling VarStdKernels info,11/18/2019 2:42:33 PM Creating encoders and decoders... info,11/18/2019 2:42:33 PM Creating transformer encoder at device '0'. HiddenDim = '128', InputDim = '128', Depth = '2', MultiHeadNum = '8' info,11/18/2019 2:42:33 PM Create feed forward layer 'TransformerEncoder.SelfAttn_0.feedForwardLayer1' InputDim = '128', OutputDim = '512', DropoutRatio = '0.1', DeviceId = '0' info,11/18/2019 2:42:33 PM Create feed forward layer 'TransformerEncoder.SelfAttn_0.feedForwardLayer2' InputDim = '512', OutputDim = '128', DropoutRatio = '0.1', DeviceId = '0' info,11/18/2019 2:42:33 PM Create feed forward layer 'TransformerEncoder.SelfAttn_1.feedForwardLayer1' InputDim = '128', OutputDim = '512', DropoutRatio = '0.1', DeviceId = '0' info,11/18/2019 2:42:33 PM Create feed forward layer 'TransformerEncoder.SelfAttn_1.feedForwardLayer2' InputDim = '512', OutputDim = '128', DropoutRatio = '0.1', DeviceId = '0' info,11/18/2019 2:42:33 PM Creating attention unit 'AttnLSTMDecoder.AttnUnit' HiddenDim = '128', ContextDim = '128', DeviceId = '0' info,11/18/2019 2:42:33 PM Create LSTM attention decoder cell 'AttnLSTMDecoder.LSTMAttn_0' HiddemDim = '128', InputDim = '128', ContextDim = '128', DeviceId = '0' info,11/18/2019 2:42:33 PM Create LSTM attention decoder cell 'AttnLSTMDecoder.LSTMAttn_1' HiddemDim = '128', InputDim = '128', ContextDim = '128', DeviceId = '0' info,11/18/2019 2:42:34 PM Create feed forward layer 'FeedForward' InputDim = '128', OutputDim = '109761', DropoutRatio = '0', DeviceId = '0' info,11/18/2019 2:42:34 PM Start to train... info,11/18/2019 2:42:34 PM Start to process training corpus. info,11/18/2019 2:42:34 PM Shuffling corpus... info,11/18/2019 2:42:44 PM Shuffled '1172687' sentence pairs to file 'D:\Seq2SeqSharp\dokiu0p0.14b' and 'D:\Seq2SeqSharp\324altii.vbu'. warn,11/18/2019 2:42:44 PM Found 7 sentences are longer than '64' tokens, ignore them. info,11/18/2019 2:42:45 PM Registering trainable parameters. info,11/18/2019 2:42:45 PM Register network 'm_srcEmbedding' info,11/18/2019 2:42:45 PM Register network 'm_tgtEmbedding' info,11/18/2019 2:42:45 PM Register network 'm_encoder' info,11/18/2019 2:42:45 PM Register network 'm_decoder' info,11/18/2019 2:42:45 PM Register network 'm_decoderFFLayer' info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_0.m_Wxhc' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_0.m_b' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_0.m_layerNorm1.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_0.m_layerNorm1.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_0.m_layerNorm2.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_0.m_layerNorm2.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_1.m_Wxhc' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_1.m_b' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_1.m_layerNorm1.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_1.m_layerNorm1.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_1.m_layerNorm2.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.LSTMAttn_1.m_layerNorm2.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.AttnUnit.m_Ua' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.AttnUnit.m_Wa' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.AttnUnit.m_bUa' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.AttnUnit.m_bWa' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'AttnLSTMDecoder.AttnUnit.m_V' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'FeedForward.m_Whd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'FeedForward.m_Bd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.Q' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.Qb' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.K' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.Kb' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.V' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.Vb' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.W0' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.b0' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.layerNorm1.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.layerNorm1.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.layerNorm2.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.layerNorm2.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.feedForwardLayer1.m_Whd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.feedForwardLayer1.m_Bd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.feedForwardLayer2.m_Whd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_0.feedForwardLayer2.m_Bd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.Q' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.Qb' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.K' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.Kb' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.V' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.Vb' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.W0' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.b0' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.layerNorm1.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.layerNorm1.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.layerNorm2.m_alpha' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.layerNorm2.m_beta' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.feedForwardLayer1.m_Whd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.feedForwardLayer1.m_Bd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.feedForwardLayer2.m_Whd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TransformerEncoder.SelfAttn_1.feedForwardLayer2.m_Bd' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'SrcEmbeddings' to optimizer. info,11/18/2019 2:43:23 PM Added weight 'TgtEmbeddings' to optimizer. info,11/18/2019 2:46:05 PM Update = 100, Epoch = 0, LR = 0.000013, Cost = 11.5811, AvgCost = 11.5988, Sent = 12800, SentPerMin = 3647.34, WordPerSec = 1690.87 info,11/18/2019 2:48:44 PM Update = 200, Epoch = 0, LR = 0.000025, Cost = 11.4263, AvgCost = 11.5592, Sent = 25600, SentPerMin = 4150.02, WordPerSec = 1923.06

GeorgeS2019 commented 4 years ago

Found 13 sentences are longer than '64' tokens, ignore them. Source language Max term id = '85146' Target language Max term id = '78871' Creating decay learning rate. StartLearningRate = '2', WarmupSteps = '8000', WeightsUpdatesCount = '0' Creating Adam optimizer. GradClip = '3' Initialize device '0' Precompiling GatherScatterKernels Precompiling IndexSelectKernels Precompiling ReduceDimIndexKernels Precompiling CudaReduceAllKernels Precompiling CudaReduceKernels Precompiling ElementwiseKernels Precompiling FillCopyKernels Precompiling AdvFuncKernels Precompiling SpatialMaxPoolKernels Precompiling VarStdKernels Creating encoders and decoders... Creating transformer encoder at device '0'. HiddenDim = '64', InputDim = '64', Depth = '2', MultiHeadNum = '8' Create feed forward layer 'TransformerEncoder.SelfAttn_0.feedForwardLayer1' InputDim = '64', OutputDim = '256', DropoutRatio = '0,1', DeviceId = '0' Create feed forward layer 'TransformerEncoder.SelfAttn_0.feedForwardLayer2' InputDim = '256', OutputDim = '64', DropoutRatio = '0,1', DeviceId = '0' Create feed forward layer 'TransformerEncoder.SelfAttn_1.feedForwardLayer1' InputDim = '64', OutputDim = '256', DropoutRatio = '0,1', DeviceId = '0' Create feed forward layer 'TransformerEncoder.SelfAttn_1.feedForwardLayer2' InputDim = '256', OutputDim = '64', DropoutRatio = '0,1', DeviceId = '0' Creating attention unit 'AttnLSTMDecoder.AttnUnit' HiddenDim = '64', ContextDim = '64', DeviceId = '0' Create LSTM attention decoder cell 'AttnLSTMDecoder.LSTMAttn_0' HiddemDim = '64', InputDim = '64', ContextDim = '64', DeviceId = '0' Create LSTM attention decoder cell 'AttnLSTMDecoder.LSTMAttn_1' HiddemDim = '64', InputDim = '64', ContextDim = '64', DeviceId = '0' Create feed forward layer 'FeedForward' InputDim = '64', OutputDim = '78871', DropoutRatio = '0', DeviceId = '0' Start to train... Start to process training corpus. Shuffling corpus... Found 13 sentences are longer than '64' tokens, ignore them. Registering trainable parameters. Register network 'm_srcEmbedding' Register network 'm_tgtEmbedding' Register network 'm_encoder' Register network 'm_decoder' Register network 'm_decoderFFLayer'

==> The console program stuck here and the memory keep going up till more than 10GB What is new is that I am now using an external GPU. This external GPU is using only 300MB Ram Most of the processing seems to happen at the CPU side.

zhongkaifu commented 4 years ago

At the first time when you run Seq2SeqConsole, it takes a longer time to compile functions for GPU and then cache them. Than for your next run, it will be much quicker.

You starting learning rate is pretty large, consider to set it to 0.001. What's your batch size ? You can try smaller batch size for your first trainings, such as 2, 4 or 8

GeorgeS2019 commented 4 years ago

I compare CPU versus external GPU With CPU, 2 sentences and batch size of 2, it immediately finishes within 10 sec. [ learning rate is as u suggested 0.001.]

With external GPU, it goes through phase of increasing 10GB ram, decrease back to baseline and then up again to 10GB ram. This happens multiple times (more than 5 times) and then I stooped it.

So, I still not sure where is the problem.

zhongkaifu commented 4 years ago

It's weird. What's your GPU model ? If you don't mind, could you please share the command line and some of your training corpus with me, and then I can test it on my machine.

GeorgeS2019 commented 4 years ago

Just sent email to you :-)

GeorgeS2019 commented 4 years ago

With the latest code refactoring, it seems the issue is gone. Thanks and great job!