yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Trouble in performing test with existing model, dataframe empty. #287

Closed Marcteen closed 6 years ago

Marcteen commented 6 years ago

I make a sequence file with 15 images and try to use a model file to get the output of the model. But now I stacked with the exception like below

image

I found the code in CaffeOnSpark.scala trigger the exception:

val n: Int = testDF.take(1)(0).getSeq[Double](index).size

in

test[T1, T2](source: DataSource[T1, T2]): Map[String, Seq[Double]]

function. Then I check the input dataframe of the funciton above, it is empty. So I turn to

features2[T1, T2](source: DataSource[T1, T2]): DataFrame

that generate the dataframe, I noticed that the

val srcDataRDD = source.makeRDD(sc)

works fine, and the count() of it output 15(the right size of my dataset), but then the

featureRDD

generate by it become emtpy, I can't figure out why. I add the data layer like:

layer { name: "data" type: "MemoryData" top: "data" top: "label" include { phase: TEST } source_class: "com.yahoo.ml.caffe.SeqImageDataSource" memory_data_param { source: "hdfs:///user/tseg/landmark/seqCropFaces" batch_size: 3 channels: 1 height: 60 width: 60 share_in_parallel: false } } and the solver file goes like:

net: "landmark_deploy.prototxt" type: "Adam" test_iter: 30 test_interval: 5000 base_lr: 0.000001 momentum: 0.9 momentum2: 0.999 lr_policy: "fixed" gamma:0.8 stepsize:100000 display: 2500 max_iter: 1500000 snapshot: 5000 snapshot_prefix: "landmark-snap" solver_mode: CPU

the submit script goes like

spark-submit --master yarn --deploy-mode cluster \ --num-executors 1 \ --files /home/tseg/user/lc/landmark68/adam_solver.prototxt,/home/tseg/user/lc/landmark68/landmark_deploy.prototxt \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -test \ -conf adam_solver.prototxt \ -connection ethernet \ -model hdfs:///user/tseg/landmark/VanFace.caffemodel \ -output hdfs:///user/tseg/landmark_result

My clusters work fine with the provided demo. Any help would be appreciated.

junshi15 commented 6 years ago

In landmark_deploy.prototxt, set batch_size = 1. Also, in adam_solver.prototxt, set max_iter: 15.

Marcteen commented 6 years ago

@junshi15 Thanks!