yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Test with multi workers on a single machine #268

Closed Yunzhi-b closed 6 years ago

Yunzhi-b commented 7 years ago

Hi! I'm using a single machine (4 cores, 10g memory) to test CaffeOnSpark.

I'm trying to use 2 workers to train my model on Standalone CPU mode, with such settings:

export MASTER_URL=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=2 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))  

${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 4G ${MASTER_URL}
${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 4G ${MASTER_URL}

pushd ${CAFFE_ON_SPARK}/data/DAE

export IPYTHON_OPTS="notebook --no-browser --ip=`hostname`"

IPYTHON=1 pyspark  --master ${MASTER_URL} \
           --executor-memory 4G \
           --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
           --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
           --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
           --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
           --conf spark.executorEnv.DYLD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
           --py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
           --conf spark.cores.max=${TOTAL_CORES} \
           --conf spark.task.cpus=${CORES_PER_WORKER} \
           --files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so,${CAFFE_ON_SPARK}/data/DAE/solver_toy.prototxt,${CAFFE_ON_SPARK}/data/DAE/dae_train_test_toy.prototxt \
           --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"

But I got an error from console when I do caffeonspark.train(train_data):

Py4JJavaError: An error occurred while calling o2842.train.
: java.lang.IllegalStateException: actual number of executors is not as expected
    at com.yahoo.ml.caffe.CaffeOnSpark.setupTraining(CaffeOnSpark.scala:132)
    at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:171)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

Could you tell me how to correctly set nb of executors?

PS: with this setting, I could successfully do normal spark operations (create dataframe, manipulate rdd...) and both the two workers do work.

Thank you!

Yunzhi-b commented 7 years ago

And I noticed that if I use only one worker with 8 cores, it will take the same training time to using one worker with only 1 core. I'm always using single machine on standalone cpu mode. Do you know the reason for this case @junshi15 ?

Thanks again!

junshi15 commented 7 years ago

It is due to this line: https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeOnSpark.scala#L129-L133, which is mainly for spark-on-yarn.

Not sure what happens with standalone mode, you can try to set CORES_PER_WORKER=1. You can check whether 8 cores are used when you set COERS_PER_WORK=8, I doubt it.

Again, I only use yarn, not familiar with the standalone mode.

Yunzhi-b commented 7 years ago

Thanks for your reply.

I tried setting CORES_PER_WORKER=1 and SPARK_WORKER_INSTANCES=2, it didn't work.

Do you know how to check whether 8 cores are really used or not when COERS_PER_WORK=8?

In sparkUI, I found this:

Worker Id | Address | State | Cores | Memory
-- | -- | -- | -- | --
worker-20170707104633-10.0.2.15-37428 | 10.0.2.15:37428 | ALIVE | 8 (8 Used) | 8.0 GB
        (8.0 GB Used)

But I didn't know how to check whether these 8 cores are really used when training.

Thanks

junshi15 commented 7 years ago

I was thinking of launching simple command like "top" (if linux), while your job is running, check how much cpu utilization is. This is quite rough though.

From your description, it is likely spark grabs 8 cores. Whether caffe uses all of them is up to caffe implementation. CaffeOnSpark does not change caffe behavior.

Yunzhi-b commented 6 years ago

Thanks a lot! I changed the mode to YARN (pseudo distributed YARN in single virtual machine) for my model. And now I have a new problem about customized layer. I have used a customized layer (python layer in Caffe):

layer {
  name: "noisydata"
  type: "Python"
  bottom: "data"
  top: "noisydata"
  top: "mask"
  include {
    phase: TEST
  }
  python_param {
    module: "noisyLayer"
    layer: "NoisyLayer"
    param_str: '{"hide_ration": 0.0}'
  }
}

And I add its directory to PYTHONPATH. I have used it without problem in CaffeOnSpark Standalone mode.

But in YARN mode, I got an error saying it couldn't find this customized layer.

I0718 13:57:33.694187   993 layer_factory.hpp:77] Creating layer data
I0718 13:57:33.694200   993 net.cpp:99] Creating Layer data
I0718 13:57:33.694206   993 net.cpp:407] data -> data
I0718 13:57:33.694228   993 net.cpp:407] data -> label
I0718 13:57:33.694249   993 cos_data_layer.cpp:46] CoSDataLayer Top #0 20 3705 (74100)
I0718 13:57:33.694254   993 cos_data_layer.cpp:46] CoSDataLayer Top #1 20 3705 (74100)
I0718 13:57:33.694257   993 net.cpp:149] Setting up data
I0718 13:57:33.694265   993 net.cpp:156] Top shape: 20 3705 (74100)
I0718 13:57:33.694273   993 net.cpp:156] Top shape: 20 3705 (74100)
I0718 13:57:33.694277   993 net.cpp:164] Memory required for data: 592800
I0718 13:57:33.694281   993 layer_factory.hpp:77] Creating layer noisydata
ImportError: No module named noisyLayer
terminate called after throwing an instance of 'boost::python::error_already_set'

I have also set paths in .bashrc: (noisyLayer is in /home/user/Workspace/ModelsOnCaffe/CustomizedLayers )

export PYTHONPATH="/home/user/Workspace/ModelsOnCaffe/CustomizedLayers":$PYTHONPATH
export PYSPARK_PYTHON=${IPYTHON_ROOT}/bin/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=${IPYTHON_ROOT}/bin/python:$PYTHONPATH

And also in spark-defaults.conf

spark.yarn.appMasterEnv.PYSPARK_PYTHON            /home/user/anaconda2/bin/python:/home/yunzhi/Workspace/ModelsOnCaffe/CustomizedLayers
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON     /home/user/anaconda2/bin/python:/home/yunzhi/Workspace/ModelsOnCaffe/CustomizedLayers

Could you help me how to achieve the use of customized layer, and how CaffeOnSpark works in this case(particularly in a real cluster YARN mode)?

junshi15 commented 6 years ago

I have no experience with customized python layer, but the error message is quite clear. Python cannot import your noise layer since it does not know where to find it. I suppose you need ship the .py file to executors and place it in the right directory.

Yunzhi-b commented 6 years ago

Thanks for your prompt reply! As I'm running the model in pseudo yarn cluster (on a single machine), I think I don't need to ship .py file, am I right? It worked in standalone mode, means that CaffeOnSpark did find it with the current path setting in standalone mode, but in yarn mode, it couldn't find it. I really dont know how to resolve this problem. Do you have some ideas? Thanks a lot, Junshi

junshi15 commented 6 years ago

Sorry, I do not know how to solve your problem.