NCE training (with help GPU)

ramin-git commented 9 years ago

Hi, Everyone. I have some questions about faster-rnnlm. First question, I want to use this toolkit with option -direct 1000, but there appears error: CUDA ERROR: Failed to allocate cuda memory for maxent out of memory I know it is due to -direct 1000, because when i use -direct 400 this error doesn't appear. Our GPU memory is 3GB. Does there exist any way to use -direct 1000 without error? Second question, you use GPU when computing only validation or test entropy. Why don't you use GPU during training? What is the reason? Don't you think it would be faster to use GPU during training?

akhti commented 9 years ago

Hi!

-direct 1000 requires 1000 * 10000000 * 4 (float) ~= 4GB memory on your GPU. As a result maxent weights fail to be copied to your GPU. A few solutions could be proposed. You can try to use CPU only mode with '-use_cuda 0' option Or disable maxent. If maxent layer is required and CPU validation is too slow, you can try to train maxent and rnnlm models separately. That is, you first train a model with '-direct 0' on GPU and a model with with '-hidden 0 -maxent 1000 -use-cuda 0' on CPU. Then you ensemble predictions of these two models.
The main reason to prefer CPU over GPU is that softmax approximations (especcially with MaxEnt layer) works faster on CPU due to many sparse memory reads/writes. Namely my prelimanary experiments with Hierarchical Softmax showed that training on CPU with Hogwild works much faster than training on GPU. On the other hand, I believe that it is possible to train NCE models without Maxent on GPU efficiently. But as I want to support HS and MaxEnt as well, that's not an option right now.

ramin-git commented 9 years ago

Our server parameters are RAM 32GB, 8 core CPU, 3.4 GHz, 3GB GPU memory. When I train the model with NCE, -use-cuda 1, without MaxEnt it takes 43-49 seconds during computing valid entropy for each 500 sentences. The command is 'taskset -c 0,1,2,3,4,5,6,7 ../rnnlm -rnnlm model-test -train train-no-one-word-new-uniq-random-dict.txt -valid valid-new-uniq.txt -hidden 200 -hidden-type sigmoid -nce 20 -nce-accurate-test 1 -use-cuda 1 -threads 8 -alpha 0.01 -rmsprop 0.9 -bptt 4 -bptt-skip 10'. And without -use-cuda it takes 43-49 seconds, too. But training without NCE, with/without MaxEnt takes only 0.05-0.07 seconds. Is it normal?

akhti commented 9 years ago

That's weird. Does the rnnlm actually learn anything in less then a second? Does valid entropy decrease?

ramin-git commented 9 years ago

I could not good explain problem. "But training without NCE, with/without MaxEnt takes only 0.05-0.07 seconds. Is it normal?" I want say training takes only 0.05-0.07 seconds for computing valid entropy for each 500 sentences. Not full traning takes 0.05-0.07 seconds

akhti commented 9 years ago

Yeap, that's normal. Validation for Hierarchical Softmax is a few orders faster than for HS.

The problem with NCE is that nobody guarantees that predicted probabilites would be stochastic, i.e. that sum of probabilities for all words would be one. That's why validation for NCE is so extremely slow - we have to renormalize probabilities. On the other hand, predicted probabilities are quite close to real one. So, if you need probabilities for some kind of rescoring, you can disable nce-accurate-test at test time and compute approximated probabilities very fast.

VeliBaba commented 9 years ago

Hi! I read all your answers to the questions asked. I want to say thanks for your answers. But I need some detailed explaination. I will ask a question, may be repeated. The question is about training rnnlm using CUDA. I understand that, CUDA is used only to compute validation entropy during training in NCE mode. Why is not CUDA used during training not only for validation entropy? Is it impossible to use CUDA during training at least for matrix operations? Could you explain if you have time?

akhti commented 9 years ago

NCE validation uses only simple operations (like matrix multiplication) and could be efficiently implemented on GPU.

As for training, some operations works faster on GPU (matrix multiplication) and some operations works faster on CPU (HS). CPU-based solution allows to use HogWild and that makes it faster then GPU-based for hidden layers or reasonable size. However, some combination of GPU & CPU may work faster. I haven't tried this yet.

yandex / faster-rnnlm

NCE training (with help GPU) #1