yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

CaffeOnSpark slow in comparison with caffe #259

Open fouad2910 opened 7 years ago

fouad2910 commented 7 years ago

Hi, I am new with big data and caffe and I tried to run CaffeOnSpark in standalone mode and also on a cluster (4 nodes each with 1 CPU, 16GB RAM, 4cores). On cluster I adapt always the batchsize in function of the clustersize but there is no gain of time. Whether the dataset (MNIST or CIFAR10) I don't see any acceleration and the performances in comparison with caffe go worse. For example Caffeonspark standalone mode with MNIST (as the example in the wiki) took 15min and on caffe with MKL library it took less than 5 min. The connection between the node is not a problem I think because its speed is 1 GB/s. Did I miss something ? Can somebody help me please ? Thank you. Best regards, fouad2910

junshi15 commented 7 years ago

Make sure you are comparing the same total batch size.

https://github.com/yahoo/CaffeOnSpark/issues/244

fouad2910 commented 7 years ago

Firstly thank you for your answer. Then I adapted the lenet_memory_solver.prototxt to be the same as the solver in caffe for MNIST, i.e. I changed the max_iter from 2000 to 10000. Before running CaffeOnSpark on 2 nodes, I adapted also the batch size (64 to 32). Then my total batch size with CaffeOnSpark is 2*32=64 and with Caffe is the same as before 64. CaffeOnSpark took 16 min to succeed and Caffe 5 min. When I enter the command top in a shell, I see that the java process is running 400% of my CPU, thenceforward I think this is not a problem of core setup.

There are below 2 logs files of CaffeOnSpark. I have the impression that the communication is done correctly but the distribute caffe takes more time. log user.txt container log.txt

junshi15 commented 7 years ago

I do not know your setup. We have seen slight improvement with lenet and on gpu. But CaffeOnSpark was not really designed to speed up tiny network/dataset like lenet/MNIST. We see more gain on Inception(or VGG)/Imagenet. Spark does create quite a bit overhead, though.

fouad2910 commented 7 years ago

Ok, I understand but even with a bigger dataset like CIFAR10 I don't see any improvement. Caffeonspark with 2 workers: 48min Caffe :20min I was expecting maybe a small increasing of time for the Caffeonspark training because of the communication and the spark jobs distribution. But here the communication speed is 1Gb/s and it takes more than twice the training time on caffe.

This is my setup: Hadoop: Hadoop 2.6.4 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 5082c73637530b0b7e115f9625ed7fac69f937e6 Compiled by jenkins on 2016-02-12T09:45Z Compiled with protoc 2.5.0

Spark: Spark version 2.0.0 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

Hardware: Each of the 4 nodes have 1 CPU 16GB RAM with 4 identical processors like the following. processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 63 model name : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz stepping : 2 microcode : 0x36 cpu MHz : 2400.052 cache size : 30720 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt bogomips : 4800.15 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual

Do you have any idea what I can check ? Or do you advise me to train a much bigger dataset or a more complex network to see if the size is the problem ? Thank you for your help!

junshi15 commented 7 years ago

Your hardware is fine. it is not obvious to me why CaffeOnSpark is so slow.

fouad2910 commented 7 years ago

Hi, I think I found something. Before I was running 10 000 steps with a batchsize of 32 and it took me 16 min. Now I changed it to 100 steps with a batchsize of 3200 and i took only 6 min. Do you have any idea why ?

fouad2910 commented 7 years ago

My previous post is wrong because if I want to make a fair comparison with Caffe I have to change also the batchsize in Caffe. After adapting the batchsize it took 4min32s to complete the training, therefore the problem still exists. I observed also that increasing the batchsize allows to process more data and then it increases the throughput, but it can lead to very poor accuracy.

fouad2910 commented 7 years ago

The previous result were running on 2 nodes, when I run the training on 4 nodes there is an acceleration and it is running faster than on Caffe !

junshi15 commented 7 years ago

Great. One useful experiment will be to run CaffeOnSpark on a single node, then compare it to Caffe.

fouad2910 commented 7 years ago

I applied your advice and there are my final results for training the MNIST (28x28 pixels) dataset on a hadoop yarn cluster. The setup is the same except for the number of training step and the batch size. The batchsize is always 6400 and the number of step is 100/number of worker. Therefore the same amount of data is trained. Then for 1 node ==> 100/1=100 steps. for 2 nodes ==> 100/2 = 50 steps. for 4 nodes ==> 100/4 = 25 steps. Running the training on Caffe took 4min34s. Running it on CaffeOnSpark took : for 1 node : 15min4s for 2 nodes : 6min36s for 4 nodes : 3min34s. I see 2 remarkable things. For a small dataset the acceleration appears after using 4 machines and the accuracy decreases. For this later remark, I think one can increase the step size to increase the accuracy by converging further to the minimum of the loss function. Do you have any idea of how does Spark distribute Caffe on the nodes and why there is an acceleration after running the training on 4 nodes ? Thank you for you answer.

junshi15 commented 7 years ago

Thanks for the result. CaffeOnSpark incurs quite a bit overhead on a single node. I don't know answer to your second question. As for the first question, Spark puts Caffe on each executor. They train in a synchronous fashion. i.e. each executor gets a batch, runs forward, then backward, the gradients get averaged then distributed before next batch is fetched.

fouad2910 commented 7 years ago

At the beginning I was reducing my batchsize to compare the same amount of data at the same time i.e. on a 4 nodes cluster each worker run 1 quarter of the batchsize (batchsize: 64 training step: 10000). Now I reduce the stepsize but I process more data at the same time i.e. each node run the same total batchsize but there are less training steps (batchsize: 6400 training step: 100) In the two methods I process the same total amount of data but it is faster in the seconde one. To sum up in the first case each node receives a fraction of the batchsize and in the second case each node receives the total batch size but there are less training steps. I am trying to run a bigger dataset (2.5G) with the two methods to compare them. Do you think that it is still a fair comparison if I reduce the stepsize instead of reducing the batchsize? Then if the batchsize is given to each executor do you think that the access to the memory (the copy of the data to process on each node) takes time and could be the major factor of the slowing in the first method ? Because here the amount of data per batchsize is small (64 images of 28x28 pixels) then it has to access a lot of time to the memory.

junshi15 commented 7 years ago

The ultimate comparison should be this: how much time does it take to achieve certain accuracy, say 90%, for 1 node, 2 nodes, etc. This comparison is hard since one has to adjust parameters such as learning rate according to the total batch size.

A simple metric is to look at overall processing rate, i.e. how may images processed per second. For example, with 1 node, if you set the batch size to 128, and it takes 2 seconds per iteration, that is 64 images/second. For 2-node, say you set the batch size to 128, and takes 3 seconds per iteration, that is 128*2/3 = 85 images/second. This metric however has to be used carefully. If the training diverges, speed does not matter anymore, you get garbage in the end.