Closed apli closed 7 years ago
It appears that the lmdb file was read incorrectly.
I have the same error when i run mnist datasets example, and i don't know how to solve this problem. I also follow this step by step. https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn And it is no problem to run cifar10 datasets
It's fixed.The lmdb path should be changed in lenet_memory_train_test.prototxt.
New issue: 17/03/30 20:00:46 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 6, u10-121-135-150): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 216.5 GB of 8.4 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Why spark request so much memory?
what were you running? the lenet example? I have never seen lenet takes so much memory.
@junshi15 , yes, I use the spark command: spark-submit --master yarn --deploy-mode cluster \ --num-executors ${SPARK_WORKER_INSTANCES} \ --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -devices ${DEVICES} \ -connection ethernet \ -model hdfs:///mnist.model \ -output hdfs:///mnist_features_result
your script seems to be correct. not sure where the problem is. For debug purpose, you can start with "--num-executors 1". But mnist is such a small dataset, a couple GB memory should be more than enough.
I tried,but still the same error. Any other advice? Thanks
no more ideas.
There are some other errors :
17/03/31 14:40:07 INFO executor.Executor: Finished task 0.3 in stage 2.0 (TID 5). 1697 bytes result sent to driver
17/03/31 14:40:07 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6
17/03/31 14:40:07 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 6)
17/03/31 14:40:07 INFO spark.MapOutputTrackerWorker: Updating epoch to 3 and clearing cache
17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1559.0 B, free 408.9 MB)
17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 9 ms
17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.5 KB, free 408.9 MB)
17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 73.0 B, free 408.9 MB)
17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 5 ms
17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 112.0 B, free 408.9 MB)
I0331 14:40:07.717497 42426 common.cpp:61] 0-th string is NULL
I0331 14:40:07.746775 42426 parallel.cpp:392] GPUs pairs
17/03/31 14:40:07 INFO caffe.CaffeProcessor: Interleave enabled
17/03/31 14:40:07 INFO caffe.CaffeProcessor: Start transformer for train in CaffeProcessor StartThreads
I0331 14:40:07.752315 42443 CaffeNet.cpp:633] Interleaved
I0331 14:40:07.752343 42443 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0331 14:40:07.752353 42443 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
17/03/31 14:40:07 INFO caffe.CaffeProcessor: Start transformer for validation in CaffeProcessor StartThreads
17/03/31 14:40:07 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 6). 1633 bytes result sent to driver
17/03/31 14:40:07 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/03/31 14:40:07 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 7)
17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 5
17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.2 KB, free 408.9 MB)
17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Reading broadcast variable 5 took 11 ms
17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.3 KB, free 408.9 MB)
17/03/31 14:40:07 INFO caffe.LmdbRDD: Processing partition 0
17/03/31 14:40:07 INFO caffe.LmdbRDD: local LMDB path:/home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/container_1490878649808_0006_01_000005/./mnist_train_lmdb
17/03/31 14:40:08 INFO caffe.LmdbRDD: Completed partition 0
17/03/31 14:40:08 WARN executor.Executor: 1 block locks were not released by TID = 7:
[rdd_1_0]
17/03/31 14:40:08 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 7). 2499 bytes result sent to driver
17/03/31 14:40:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
17/03/31 14:40:08 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 8)
17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 6
17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1267.0 B, free 408.9 MB)
17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 6 took 6 ms
17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2008.0 B, free 408.9 MB)
17/03/31 14:40:08 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 8). 1539 bytes result sent to driver
17/03/31 14:40:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9
17/03/31 14:40:08 INFO executor.Executor: Running task 0.0 in stage 6.0 (TID 9)
17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 7
17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.9 KB, free 408.9 MB)
17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 7 took 6 ms
17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 4.6 KB, free 408.9 MB)
17/03/31 14:40:08 INFO storage.BlockManager: Found block rdd_1_0 locally
17/03/31 14:40:08 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/03/31 14:40:08 INFO storage.DiskBlockManager: Shutdown hook called
17/03/31 14:40:08 INFO util.ShutdownHookManager: Shutdown hook called
17/03/31 14:40:08 INFO util.ShutdownHookManager: Deleting directory /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/spark-c3bfa4ef-559a-4464-988b-94213861c50f
17/03/31 14:40:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/09/temp_shuffle_afe02d52-6090-4f3f-be2e-6675591458ab
java.io.FileNotFoundException: /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/09/temp_shuffle_afe02d52-6090-4f3f-be2e-6675591458ab (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.
Fixed by setting yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml. I'm still confused about how virtual memory calculated? Thanks all.
I'm getting a similar error when executing the mnist-yarn tutorial:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, HOSTNAME.com): java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:184)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:201)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.<init>(LmdbRDD.scala:101)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I already fixed the the lmdb path in lenet_memory_train_test.prototxt, but error still appears.
I'm just wondering, that in CaffeOnSpark/data/lenet_cos_train_test.prototxt
is a reference to hdfs:///projects/machine_learning/image_dataset/mnist_train_dataframe/
, but the uploaded file in HDFS is named mnist_train_lmdb
(same for test). I renamed both, but no change. Is this a possible connection to my problem? I don't even know if lenet_cos_train_test.prototxt
belongs to the example as its not included in the spark-submit-command, but it's just an idea.
Can anyone help?
Finally I found the reason for my problem.
Changing a path in the lenet_memory_train_test.prototxt.
was a good hint. The "source"-keys have to point to a hdfs:///-path. A file://-path like in the preset prototxt causes an NullPointerException.
Easy when you know it, but I think it's noted nowhere.
The wiki definitely needs an update (thats not the only mistake in there). Maybe I'll open a new issue for that.
@BlueRayONE thanks for your feedback. If you can open an issue to detail the errors in the wiki, I will have them fixed.
Increase the driver memory and it will be resolved. Pass this as parameter...
Running the example in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn,error occurs as below, and I use spark 2.0.0,hadoop 2.7.1
17/03/23 14:19:11 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NullPointerException java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:184) at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:201) at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:45) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) at org.apache.spark.rdd.RDD.count(RDD.scala:1115) at com.yahoo.ml.caffe.CaffeOnSpark.trainWithValidation(CaffeOnSpark.scala:257) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:42) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) 17/03/23 14:19:11 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException) 17/03/23 14:19:11 INFO spark.SparkContext: Invoking stop() from shutdown hook