Closed Yunzhi-b closed 6 years ago
And I noticed that if I use only one worker with 8 cores, it will take the same training time to using one worker with only 1 core. I'm always using single machine on standalone cpu mode. Do you know the reason for this case @junshi15 ?
Thanks again!
It is due to this line: https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeOnSpark.scala#L129-L133, which is mainly for spark-on-yarn.
Not sure what happens with standalone mode, you can try to set CORES_PER_WORKER=1. You can check whether 8 cores are used when you set COERS_PER_WORK=8, I doubt it.
Again, I only use yarn, not familiar with the standalone mode.
Thanks for your reply.
I tried setting CORES_PER_WORKER=1 and SPARK_WORKER_INSTANCES=2, it didn't work.
Do you know how to check whether 8 cores are really used or not when COERS_PER_WORK=8?
In sparkUI, I found this:
Worker Id | Address | State | Cores | Memory
-- | -- | -- | -- | --
worker-20170707104633-10.0.2.15-37428 | 10.0.2.15:37428 | ALIVE | 8 (8 Used) | 8.0 GB
(8.0 GB Used)
But I didn't know how to check whether these 8 cores are really used when training.
Thanks
I was thinking of launching simple command like "top" (if linux), while your job is running, check how much cpu utilization is. This is quite rough though.
From your description, it is likely spark grabs 8 cores. Whether caffe uses all of them is up to caffe implementation. CaffeOnSpark does not change caffe behavior.
Thanks a lot! I changed the mode to YARN (pseudo distributed YARN in single virtual machine) for my model. And now I have a new problem about customized layer. I have used a customized layer (python layer in Caffe):
layer {
name: "noisydata"
type: "Python"
bottom: "data"
top: "noisydata"
top: "mask"
include {
phase: TEST
}
python_param {
module: "noisyLayer"
layer: "NoisyLayer"
param_str: '{"hide_ration": 0.0}'
}
}
And I add its directory to PYTHONPATH. I have used it without problem in CaffeOnSpark Standalone mode.
But in YARN mode, I got an error saying it couldn't find this customized layer.
I0718 13:57:33.694187 993 layer_factory.hpp:77] Creating layer data
I0718 13:57:33.694200 993 net.cpp:99] Creating Layer data
I0718 13:57:33.694206 993 net.cpp:407] data -> data
I0718 13:57:33.694228 993 net.cpp:407] data -> label
I0718 13:57:33.694249 993 cos_data_layer.cpp:46] CoSDataLayer Top #0 20 3705 (74100)
I0718 13:57:33.694254 993 cos_data_layer.cpp:46] CoSDataLayer Top #1 20 3705 (74100)
I0718 13:57:33.694257 993 net.cpp:149] Setting up data
I0718 13:57:33.694265 993 net.cpp:156] Top shape: 20 3705 (74100)
I0718 13:57:33.694273 993 net.cpp:156] Top shape: 20 3705 (74100)
I0718 13:57:33.694277 993 net.cpp:164] Memory required for data: 592800
I0718 13:57:33.694281 993 layer_factory.hpp:77] Creating layer noisydata
ImportError: No module named noisyLayer
terminate called after throwing an instance of 'boost::python::error_already_set'
I have also set paths in .bashrc: (noisyLayer is in /home/user/Workspace/ModelsOnCaffe/CustomizedLayers )
export PYTHONPATH="/home/user/Workspace/ModelsOnCaffe/CustomizedLayers":$PYTHONPATH
export PYSPARK_PYTHON=${IPYTHON_ROOT}/bin/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=${IPYTHON_ROOT}/bin/python:$PYTHONPATH
And also in spark-defaults.conf
spark.yarn.appMasterEnv.PYSPARK_PYTHON /home/user/anaconda2/bin/python:/home/yunzhi/Workspace/ModelsOnCaffe/CustomizedLayers
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON /home/user/anaconda2/bin/python:/home/yunzhi/Workspace/ModelsOnCaffe/CustomizedLayers
Could you help me how to achieve the use of customized layer, and how CaffeOnSpark works in this case(particularly in a real cluster YARN mode)?
I have no experience with customized python layer, but the error message is quite clear. Python cannot import your noise layer since it does not know where to find it. I suppose you need ship the .py file to executors and place it in the right directory.
Thanks for your prompt reply! As I'm running the model in pseudo yarn cluster (on a single machine), I think I don't need to ship .py file, am I right? It worked in standalone mode, means that CaffeOnSpark did find it with the current path setting in standalone mode, but in yarn mode, it couldn't find it. I really dont know how to resolve this problem. Do you have some ideas? Thanks a lot, Junshi
Sorry, I do not know how to solve your problem.
Hi! I'm using a single machine (4 cores, 10g memory) to test CaffeOnSpark.
I'm trying to use 2 workers to train my model on Standalone CPU mode, with such settings:
But I got an error from console when I do caffeonspark.train(train_data):
Could you tell me how to correctly set nb of executors?
PS: with this setting, I could successfully do normal spark operations (create dataframe, manipulate rdd...) and both the two workers do work.
Thank you!