Out of Memory error - Githubissues

JunjieHu commented 8 years ago

Hi @yukezhu

I run your program to retrain the model. But I get the out-of-memory error. I have GTX 1080 (8GB) installed in my machine. Which GPU do you use to train the model? How much memory do you use?

` $ th train_telling.lua -gpuid 0 -mc_evaluation -verbose -finetune_cnn_after -1

QADatasetLoader loading dataset file: visual7w-toolkit/datasets/visual7w-telling/dataset.json image size is 28653 QADatasetLoader loading json file: data/qa_data.json vocab size is 3007 QADatasetLoader loading h5 file: data/qa_data.h5 max question sequence length in data is 15 max answer sequence length in data is 5 assigned 5678 images to split val assigned 8609 images to split test assigned 14366 images to split train [libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432081 Successfully loaded cnn_models/VGG_ILSVRC_16_layers.caffemodel conv1_1: 64 3 3 3 conv1_2: 64 64 3 3 conv2_1: 128 64 3 3 conv2_2: 128 128 3 3 conv3_1: 256 128 3 3 conv3_2: 256 256 3 3 conv3_3: 256 256 3 3 conv4_1: 512 256 3 3 conv4_2: 512 512 3 3 conv4_3: 512 512 3 3 conv5_1: 512 512 3 3 conv5_2: 512 512 3 3 conv5_3: 512 512 3 3 fc6: 1 1 25088 4096 fc7: 1 1 4096 4096 fc8: 1 1 4096 1000 converting first layer conv filters from BGR to RGB... THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-1922/cutorch/lib/THC/generic/THCStorage.cu line=65 error=2 : out of memory /home/jjhu/torch/install/bin/lua: /home/jjhu/torch/install/share/lua/5.2/nn/utils.lua:11: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-1922/cutorch/lib/THC/generic/THCStorage.cu:65 stack traceback: [C]: in function 'resize' /home/jjhu/torch/install/share/lua/5.2/nn/utils.lua:11: in function 'torch_Storage_type' /home/jjhu/torch/install/share/lua/5.2/nn/utils.lua:57: in function 'recursiveType' /home/jjhu/torch/install/share/lua/5.2/nn/Module.lua:152: in function 'type' /home/jjhu/torch/install/share/lua/5.2/nn/utils.lua:45: in function 'recursiveType' /home/jjhu/torch/install/share/lua/5.2/nn/utils.lua:41: in function 'recursiveType' /home/jjhu/torch/install/share/lua/5.2/nn/Module.lua:152: in function </home/jjhu/torch/install/share/lua/5.2/nn/Module.lua:143> (...tail calls...) train_telling.lua:131: in main chunk [C]: in function 'dofile' ...jjhu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: in ? `

JunjieHu commented 8 years ago

Even when I use CPU to train the model, I got another error. Do you know how to fix it? Thanks!

` $ th train_telling.lua -gpuid -1 -mc_evaluation -verbose -finetune_cnn_after -1

QADatasetLoader loading dataset file: visual7w-toolkit/datasets/visual7w-telling/dataset.json image size is 28653 QADatasetLoader loading json file: data/qa_data.json vocab size is 3007 QADatasetLoader loading h5 file: data/qa_data.h5 max question sequence length in data is 15 max answer sequence length in data is 5 assigned 5678 images to split val assigned 14366 images to split train assigned 8609 images to split test [libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432081 Successfully loaded cnn_models/VGG_ILSVRC_16_layers.caffemodel conv1_1: 64 3 3 3 conv1_2: 64 64 3 3 conv2_1: 128 64 3 3 conv2_2: 128 128 3 3 conv3_1: 256 128 3 3 conv3_2: 256 256 3 3 conv3_3: 256 256 3 3 conv4_1: 512 256 3 3 conv4_2: 512 512 3 3 conv4_3: 512 512 3 3 conv5_1: 512 512 3 3 conv5_2: 512 512 3 3 conv5_3: 512 512 3 3 fc6: 1 1 25088 4096 fc7: 1 1 4096 4096 fc8: 1 1 4096 1000 converting first layer conv filters from BGR to RGB... total number of parameters in RNN: 25609800 total number of parameters in CNN: 136358208 constructing clones inside the QA model /home/jjhu/torch/install/bin/lua: ./modules/QAAttentionModel.lua:257: attempt to call global 'unpack' (a nil value) stack traceback: ./modules/QAAttentionModel.lua:257: in function <./modules/QAAttentionModel.lua:200> (...tail calls...) train_telling.lua:188: in function 'lossFun' train_telling.lua:336: in main chunk [C]: in function 'dofile' ...jjhu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: in ? `

yukezhu commented 8 years ago

This is an error due to different versions of Lua. Add unpack = unpack or table.unpack on the top of the script, or change unpack to table.unpack will solve this problem.

yukezhu commented 8 years ago

I used a K40 to train the model. If you get Out of Memory error, try a smaller batch size or a shorter max sequence length. One possible cause of the error might be because you are using LuaJIT. Installing plain lua (instead of luajit) can solve the Out of Memory problem.

JunjieHu commented 8 years ago

Hi @yukezhu

Thanks for your reply. Changing unpack to table.unpack does solve my second problem.

On the other hand, changing the batch_size from 64 to 1 doesn't solve the out-of-memory problem. Actually I installed lua 5.2 instead of luajit. Instead of running on gpu mode, I try cpu mode to run the train_telling.lua script, and it runs smoothly!! I observe that the program occupies ~5.8Gb RAM. I think my GPU has 8GB memory, why would the OOM problem happen? Thanks again for your insightful suggestions.

yukezhu commented 8 years ago

@JunjieHu It's hard for me to diagnose the OOM problem without more system config information. I hypothesize the error might come from the particular cuda, cudnn or torch versions that you are using. As the CPU mode runs smoothly, I would recommend you to go with it for now. This code can train a good model on CPU within a day.

JunjieHu commented 7 years ago

Hi, @yukezhu Thanks for your help! I solve all the configuration problems and everything works well. I notice that the repo only contains the telling questions, would you plan to release the code for the pointing questions?

nelsonruwa commented 7 years ago

@JunjieHu Im having the same challenge of out of memory error with the GPU model. How did you solve the configuration problem. Have been trying different tricks but to no avail. Thanks in advance.

yukezhu / visual7w-qa-models

Out of Memory error #2