How to run CaffeOnSpark with a pre-existing model?

lakshya97 commented 7 years ago

Hi,

I'm looking for a way to separate the training and testing phases in CaffeOnSpark. In other words, I'd like to create an MNIST Model and train it in one phase and then test it in another (and save that model for testing with different data). Is it possible to do this without interleaving the data (as is done in the wiki example)? For example, first I would train the model and generate it without testing anything. Then, I could use that existing model (without training a new one on the same training data all over again) on multiple different test datasets.

Is there a way to do this? Additionally, regardless of the separation of the phases, is there a way to use an existing/trained CaffeOnSpark model on new data (instead of creating an entirely new model on training data each time you wish to run it)? How could I do this/what commands do I need to modify?

Thanks!

junshi15 commented 7 years ago

Yes, you can test with an existing model. https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2 You just need to remove the "-train -persistent" options.

hadoop fs -rm -f /cifar10.model.h5 /cifar10_features_result spark-submit --master ${MASTER_URL} \ --files cifar10_quick_solver.prototxt,cifar10_quick_train_test.prototxt,mean.binaryproto \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -test \ -conf cifar10_quick_solver.prototxt \ -clusterSize ${SPARK_WORKER_INSTANCES} \ -devices ${DEVICES} \ -connection ethernet \ -model /cifar10.model.h5 \ -output /cifar10_test_result hadoop fs -ls /cifar10.model.h5 hadoop fs -cat /cifar10_test_result

lakshya97 commented 7 years ago

What about for LMDB on YARN? I imagine it would be similar but we would remove the lines reading -train \ -features accuracy,loss -label label \:

and replace it with just a -test, right?

Are there any other files we would need to modify? And where would we store the model (is there a need to remove it from hadoop?) and how would we tell CaffeOnSpark to read from that model instead of generating a new one? I thought we would remove the line reading hadoop fs -rm -f hdfs:///mnist.model if we want to it to read from the existing model stored in hdfs, but is this wrong?

Thank you!! (below is what I'd imagine it should look like?)

hadoop fs -rm -r -f hdfs:///mnist_features_result spark-submit --master yarn --deploy-mode cluster \ --num-executors ${SPARK_WORKER_INSTANCES} \ --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -test -conf lenet_memory_solver.prototxt \ -devices ${DEVICES} \ -connection ethernet \ -model hdfs:///mnist.model \ -output hdfs:///mnist_features_result hadoop fs -ls hdfs:///mnist.model hadoop fs -cat hdfs:///mnist_features_result/*

junshi15 commented 7 years ago

"LMDB" is a data format, to use it, you need change "source_class" in lenet_memory_train_test.prototxt. We do not recommend "LMDB" for large data set since it is not a distributed data format.

"hadoop fs -rm" deletes the file/directory, if you don't want to delete it, don't do it. Note the job will fail if your program writes to an existing directory, since overwriting is not allowed.

Only "-train" generates new model. "-test" and "-features" read the provided model. Don't delete the existing model if you use either "-test" or "-features" since it won't be able to read it.

lakshya97 commented 7 years ago

Thank you, I did all that and it runs fine now :). One other question: You said LMDB is not a distributed data format, but Spark still partitions the work across the workers, so we can still use it for distributed learning right? I am finding that the when I use 3 nodes for a ~1GB LMDB file, it is much faster than using 1 node for it (as I keep the batch size the same at 64, meaning I get 3x the throughput per iteration and thus would need 1/3 of the original number of iterations). Am I wrong?

Thank you

junshi15 commented 7 years ago

CaffeOnSpark will copy entire LMDB file to all executors, since the we can not really partition it without reading it first, as opposite to dataframe or sequencefile, where you can read part of the file.

https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/LmdbRDD.scala#L43

Spark does partition the file afterwards, so each executor only processes partitions. You effectively used 3X batch size, you may want to look at your accuracy, sometime, you need to tweak the learning rate. and you may need a little bit more than 1/3 original total number of iterations.

yahoo / CaffeOnSpark

How to run CaffeOnSpark with a pre-existing model? #265