yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

How to switch on GPU mode on multiple nodes #109

Open gnosisyuw opened 8 years ago

gnosisyuw commented 8 years ago

I built a spark cluster with several nodes to implement CaffeOnSpark. Each node is equipped with GPU. I compiled CaffeOnSpark with GPU mode and also changed configuration in the solver. However, while I followed tutorial on GetStarted_local and deployed mnist training on multiple nodes (standalone mode), the computation was carried out by CPU instead of GPU. On the contrary, training on single node through CaffeOnSpark used GPU successfully.

So my question is how to switch on GPU mode on multiple nodes.

anfeng commented 8 years ago

Please check your solver configuration file

Andy

On Wed, Jul 13, 2016 at 10:57 AM, GnosisYu notifications@github.com wrote:

I built a spark cluster with several nodes to implement CaffeOnSpark. Each node is equipped with GPU. I compiled CaffeOnSpark with GPU mode and also changed configuration in the solver. However, while I followed tutorial on GetStarted_local and deployed mnist training on multiple nodes (standalone mode), the computation was carried out by CPU instead of GPU. On the contrary, training on single node through CaffeOnSpark used GPU successfully.

So my question is how to switch on GPU mode on multiple nodes.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yahoo/CaffeOnSpark/issues/109, or mute the thread https://github.com/notifications/unsubscribe/AClTeJTf2HtgPpoPRNKRE4yqbrFJuTg1ks5qVScVgaJpZM4JLrtf .

gnosisyuw commented 8 years ago

@anfeng I have modified the solver configuration to GPU solver_mode. But it didn't work

guchensmile commented 8 years ago

I have another question, how to improve the GPU usage rate, when I submit task (standalone mode), I can see GPU usage with: Timestamp : Thu Jul 14 12:49:37 2016 Driver Version : 352.39

Attached GPUs : 2 GPU 0000:01:00.0 Utilization Gpu : 9 % Memory : 0 % Encoder : 0 % Decoder : 0 %

GPU is always 9% when task run, is there any config parameter can improve this?

BTW, I find when I set spark.task.cpus a large value, the much time the task cost.

junshi15 commented 8 years ago

It is possible that GPUs were not available to you, or some of the cluster settings are incorrect. Let's say a node has only one gnu, but two containers are scheduled on it, each of which requires one GPU. One of the containers may not get a GPU if the GPU can not be shared. Please look for "GPU not available" or similar wording in the log file.

We tested standalone mode on Amazon EC2 a while back. It worked fine.

Regarding GPU usage, there are two possibilities: 1) your network is too slow to feed the GPUs or to update the gradients. 2) your batch size is too small.

For (2), you can increase batch size to see if GPU utilization improves.

gnosisyuw commented 8 years ago

@guchensmile , my case is similar to yours. Do you check the utility of CPU? Actually, starting training did increase utility of GPU a little(5%). Instead, the utility of CPU increased tremendously(100% for 4 cores on each node). So my assumption is that CaffeOnSpark needs CPU to allocate the job and the low utility of GPU is resulted from bottleneck of CPU capability. @junshi15 is this possible?

guchensmile commented 8 years ago

@GnosisYu as you said, when I set spark.task.cpus 2, the utility of CPU is almost 200% on both 2 node.

@junshi15 Thanks for your reply, I use dataset in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn hadoop fs -put -f ${CAFFE_ONSPARK}/data/mnist*_lmdb hdfs:/projects/machine_learning/image_dataset/ The default value batch_size: 64 I will try more large value.

gnosisyuw commented 7 years ago

@guchensmile . I am pretty sure that GPU was used during the trainning. When I increased the batch size, the utility of GPU increased proportionally.