yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 357 forks source link

An error that "java.lang.UnsupportedOperationException: empty.reduceLeft" caused test unsuccessful #246

Open mollyStark opened 7 years ago

mollyStark commented 7 years ago

Hi, I met this error "java.lang.UnsupportedOperationException: empty.reduceLeft", although I found #61 have asked about this error, but I don't think they are caused by the same reason.

In #61 , it is cause by the input dataframe is empty( source file not correct ), but I tried to use the same data source but less row(1 row) in dataframe, and the test process is successful. So it seems the error is none of the bussiness with the dataframe input location but correspond with the dataframe length! How wired!

The complete error message is like below:

17/04/10 02:29:18 INFO datasources.FileScanRDD Executor task launch worker-0: Reading File path: hdfs:///home/xxx/resultdata/test_one/part-r-00000-f3d5cef3-9212-49e5-8d95-00801452d61f.gz.parquet, range: 0-7958646, partition values: [empty row]
17/04/10 02:29:18 INFO broadcast.TorrentBroadcast Executor task launch worker-0: Started reading broadcast variable 1
17/04/10 02:29:18 INFO memory.MemoryStore Executor task launch worker-0: Block broadcast_1_piece0 stored as bytes in memory (estimated size 25.8 KB, free 408.8 MB)
17/04/10 02:29:18 INFO broadcast.TorrentBroadcast Executor task launch worker-0: Reading broadcast variable 1 took 12 ms
17/04/10 02:29:18 INFO memory.MemoryStore Executor task launch worker-0: Block broadcast_1 stored as values in memory (estimated size 322.3 KB, free 408.5 MB)
17/04/10 02:29:18 INFO zlib.ZlibFactory Executor task launch worker-0: Successfully loaded & initialized native-zlib library
17/04/10 02:29:18 INFO compress.CodecPool Executor task launch worker-0: Got brand-new decompressor [.gz]
I0410 02:29:19.128298   337 CaffeNet.cpp:643] Test only
I0410 02:29:19.128367   337 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
17/04/10 02:29:21 INFO caffe.ImageDataFrame ForkJoinPool-1-worker-2: Completed all files
17/04/10 02:29:21 INFO codegen.CodeGenerator Executor task launch worker-0: Code generated in 20.388044 ms
17/04/10 02:29:21 INFO codegen.CodeGenerator Executor task launch worker-0: Code generated in 19.527992 ms
17/04/10 02:29:21 INFO codegen.CodeGenerator Executor task launch worker-0: Code generated in 6.525526 ms
17/04/10 02:29:21 INFO codegen.CodeGenerator Executor task launch worker-0: Code generated in 22.044114 ms
17/04/10 02:29:22 INFO codegen.CodeGenerator Executor task launch worker-0: Code generated in 10.760486 ms
17/04/10 02:29:22 INFO executor.Executor Executor task launch worker-0: Finished task 0.0 in stage 2.0 (TID 2). 3813 bytes result sent to driver
17/04/10 02:29:22 INFO executor.CoarseGrainedExecutorBackend dispatcher-event-loop-14: Got assigned task 3
17/04/10 02:29:22 INFO executor.Executor Executor task launch worker-0: Running task 1.0 in stage 2.0 (TID 3)
17/04/10 02:29:22 INFO datasources.FileScanRDD Executor task launch worker-0: Reading File path: hdfs:///home/xxx/resultdata/test_one/part-r-00000-f3d5cef3-9212-49e5-8d95-00801452d61f.gz.parquet, range: 7958646-11722988, partition values: [empty row]
I0410 02:29:22.095670   337 CaffeNet.cpp:643] Test only
I0410 02:29:22.095711   337 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
17/04/10 02:29:22 WARN storage.BlockManager Executor task launch worker-0: Putting block rdd_12_1 failed
17/04/10 02:29:22 ERROR executor.Executor Executor task launch worker-0: Exception in task 1.0 in stage 2.0 (TID 3)
java.lang.UnsupportedOperationException: empty.reduceLeft
        at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
        at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
        at scala.collection.AbstractIterator.reduce(Iterator.scala:1336)
        at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$18.apply(CaffeOnSpark.scala:492)
        at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$18.apply(CaffeOnSpark.scala:484)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        ...

We can see that the dataframe is read for two parts, the first part is range: 0-7958646, and it seems to be successfully tested, and the error part is range: 7958646-11722988, and there is a WARN MESSAGE shows WARN storage.BlockManager Executor task launch worker-0: Putting block rdd_12_1 failed. So I'm wondering if this empty.reduceLeft error is associated with this warning?

There is more information about this dataframe: There is 15M of data and 100 row in it.

Please help me solve this problem, I've stuck in this for weeks. Thank you!

junshi15 commented 7 years ago

how many executors were you using? For debug purpose, you may want to use one executor (if you are not using 1 executor already). Also, please verify your data frame is correct.

mollyStark commented 7 years ago

@junshi15 I've used 1 executor by setting --num-executors to 1 and still faced with this problem, and the log is the same as above, that reads to two parts. And I checked the same dataframe by running the job on my local standalone machine. It had no problem. The only difference between my local machine and the cluster is that local spark version is 1.5.2 and cluster spark version is 2.0.0. Also, I read the dataframe( the parquent file) in command line, the data have not been damaged.