yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Some questuions of transform: It looks like that i lost some images after transform(Modified the source code for R-FCN) #272

Closed Zzmc closed 6 years ago

Zzmc commented 6 years ago

hi everyone: when i run the CaffeOnSpak, i modified some code in net.cpp , so i can see which image sended in Convolution layer,but there is a probleam for me. it looks like i lost some image after transform case CoSDataParameter.DataType.RAW_IMAGE | CoSDataParameter.DataType.ENCODED_IMAGE | CoSDataParameter.DataType.ENCODED_IMAGE_WITH_DIM=> { if (transformers(i) != null) { transformers(i).transform(dataArray(i).asInstanceOf[MatVector], data(i)) }. For example: when i run minist in ordinary Caffe,the image number is 0,1,2,3,4,5,6,7,8,9, the label also is 0,1,2,3,4,5,6,7,8,9. But when i run CaffeOnSpark the image number is 2,3,4,5,6,7,8,9,0 the label is 0,1,,2,3,4,5,6,7,8,9(I have modified the source code, i want to run R-FCN on CaffeOnSpark,i have solve cll most problems , but this problem confused me),can you give some suggstions ? thank you !

junshi15 commented 6 years ago

check your input data frame first, are the images and labels aligned?

Zzmc commented 6 years ago

@junshi15 thanks for you answer, i checked the images, and i can make sure that images and labels are aligned. And i find that when i adjust the test_iter and test_interval set to 1, it is wrong with my images , it will lost 2 images at first, then i adjust the test_iter set to 1 and the test_interval set to 10, it is right at first with my images,after 10 images,when it need a test image,it is going wrong again,one image and one label is not on.After this one test image, the images and labels are aligned again(it means after a test image it will wrong with image and label).Do you know waht‘s wrong with that?

junshi15 commented 6 years ago

not sure about the validation. what happens if you set test_iter and test_interval to zero, which will disable the validation path.

Zzmc commented 6 years ago

@junshi15 I tried set test_iter and test_interval to zero,but it can't work successful,it's wrong,the wrong infomation is: java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) at scala.collection.AbstractIterator.reduce(Iterator.scala:1336) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$8.apply(CaffeOnSpark.scala:226) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$8.apply(CaffeOnSpark.scala:213) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) the wrong code is CaffeOnSpark.scala 204-227 my net.prototxt is: layer { name: "data" type: "DFAnnoData" top: "data" top: "gt_boxes" include { phase: TRAIN } source_class: "com.yahoo.ml.caffe.AnnoDataFrameSource" data_param { source: "file:/home/cos/R-FCN/0814_lmdb_DF" batch_size: 1 } augment_param { mirror: true mean_value: 104 mean_value: 117 mean_value: 123 resize_param { prob: 1 resize_mode: WARP height: 1000 width: 1778 interp_mode: LINEAR interp_mode: AREA interp_mode: NEAREST interp_mode: CUBIC interp_mode: LANCZOS4 } } } layer{ name: "data" type: "DFAnnoData" top: "data" top: "gt_boxes" include{ phase: TEST } source_class: "com.yahoo.ml.caffe.AnnoDataFrameSource" data_param { source: "file:/home/cos/R-FCN/0814_lmdb_DF" batch_size: 1 } augment_param { mirror: true mean_value: 104 mean_value: 117 mean_value: 123 resize_param { prob: 1 resize_mode: WARP height: 1000 width: 1778 interp_mode: LINEAR interp_mode: AREA interp_mode: NEAREST interp_mode: CUBIC interp_mode: LANCZOS4 } } }

my solver.prototxt is: net: "/home/cos/R-FCN/R-FCN_train_DF_5.prototxt" test_iter: 0 test_interval: 0 base_lr: 0.001 lr_policy: "step" gamma: 0.01 stepsize: 16 display: 1 momentum: 0.9 weight_decay: 0.0005 max_iter: 40 snapshot: 50 snapshotprefix: "R-FCN" iter_size: 1 solver_mode: GPU

Do you know how to solve this probleam , Thanks.