yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

java.lang.NullPointerException in DNN network training #239

Closed apli closed 7 years ago

apli commented 7 years ago

Running the example in https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn,error occurs as below, and I use spark 2.0.0,hadoop 2.7.1

17/03/23 14:19:11 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NullPointerException java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:184) at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:201) at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:45) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) at org.apache.spark.rdd.RDD.count(RDD.scala:1115) at com.yahoo.ml.caffe.CaffeOnSpark.trainWithValidation(CaffeOnSpark.scala:257) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:42) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) 17/03/23 14:19:11 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException) 17/03/23 14:19:11 INFO spark.SparkContext: Invoking stop() from shutdown hook

junshi15 commented 7 years ago

It appears that the lmdb file was read incorrectly.

yansmallb commented 7 years ago

I have the same error when i run mnist datasets example, and i don't know how to solve this problem. I also follow this step by step. https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn And it is no problem to run cifar10 datasets

apli commented 7 years ago

It's fixed.The lmdb path should be changed in lenet_memory_train_test.prototxt.

apli commented 7 years ago

New issue: 17/03/30 20:00:46 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 6, u10-121-135-150): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 216.5 GB of 8.4 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Why spark request so much memory?

junshi15 commented 7 years ago

what were you running? the lenet example? I have never seen lenet takes so much memory.

apli commented 7 years ago

@junshi15 , yes, I use the spark command: spark-submit --master yarn --deploy-mode cluster \ --num-executors ${SPARK_WORKER_INSTANCES} \ --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -train \ -features accuracy,loss -label label \ -conf lenet_memory_solver.prototxt \ -devices ${DEVICES} \ -connection ethernet \ -model hdfs:///mnist.model \ -output hdfs:///mnist_features_result

junshi15 commented 7 years ago

your script seems to be correct. not sure where the problem is. For debug purpose, you can start with "--num-executors 1". But mnist is such a small dataset, a couple GB memory should be more than enough.

apli commented 7 years ago

I tried,but still the same error. Any other advice? Thanks

junshi15 commented 7 years ago

no more ideas.

apli commented 7 years ago

There are some other errors :
17/03/31 14:40:07 INFO executor.Executor: Finished task 0.3 in stage 2.0 (TID 5). 1697 bytes result sent to driver 17/03/31 14:40:07 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6 17/03/31 14:40:07 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 6) 17/03/31 14:40:07 INFO spark.MapOutputTrackerWorker: Updating epoch to 3 and clearing cache 17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4 17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1559.0 B, free 408.9 MB) 17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 9 ms 17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.5 KB, free 408.9 MB) 17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 73.0 B, free 408.9 MB) 17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 5 ms 17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 112.0 B, free 408.9 MB) I0331 14:40:07.717497 42426 common.cpp:61] 0-th string is NULL I0331 14:40:07.746775 42426 parallel.cpp:392] GPUs pairs 17/03/31 14:40:07 INFO caffe.CaffeProcessor: Interleave enabled 17/03/31 14:40:07 INFO caffe.CaffeProcessor: Start transformer for train in CaffeProcessor StartThreads I0331 14:40:07.752315 42443 CaffeNet.cpp:633] Interleaved I0331 14:40:07.752343 42443 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used I0331 14:40:07.752353 42443 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used 17/03/31 14:40:07 INFO caffe.CaffeProcessor: Start transformer for validation in CaffeProcessor StartThreads 17/03/31 14:40:07 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 6). 1633 bytes result sent to driver 17/03/31 14:40:07 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7 17/03/31 14:40:07 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 7) 17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 5 17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.2 KB, free 408.9 MB) 17/03/31 14:40:07 INFO broadcast.TorrentBroadcast: Reading broadcast variable 5 took 11 ms 17/03/31 14:40:07 INFO memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.3 KB, free 408.9 MB) 17/03/31 14:40:07 INFO caffe.LmdbRDD: Processing partition 0 17/03/31 14:40:07 INFO caffe.LmdbRDD: local LMDB path:/home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/container_1490878649808_0006_01_000005/./mnist_train_lmdb 17/03/31 14:40:08 INFO caffe.LmdbRDD: Completed partition 0 17/03/31 14:40:08 WARN executor.Executor: 1 block locks were not released by TID = 7: [rdd_1_0] 17/03/31 14:40:08 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 7). 2499 bytes result sent to driver 17/03/31 14:40:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8 17/03/31 14:40:08 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 8) 17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 6 17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1267.0 B, free 408.9 MB) 17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 6 took 6 ms 17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2008.0 B, free 408.9 MB) 17/03/31 14:40:08 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 8). 1539 bytes result sent to driver 17/03/31 14:40:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9 17/03/31 14:40:08 INFO executor.Executor: Running task 0.0 in stage 6.0 (TID 9) 17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 7 17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.9 KB, free 408.9 MB) 17/03/31 14:40:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 7 took 6 ms 17/03/31 14:40:08 INFO memory.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 4.6 KB, free 408.9 MB) 17/03/31 14:40:08 INFO storage.BlockManager: Found block rdd_1_0 locally 17/03/31 14:40:08 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM 17/03/31 14:40:08 INFO storage.DiskBlockManager: Shutdown hook called 17/03/31 14:40:08 INFO util.ShutdownHookManager: Shutdown hook called 17/03/31 14:40:08 INFO util.ShutdownHookManager: Deleting directory /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/spark-c3bfa4ef-559a-4464-988b-94213861c50f 17/03/31 14:40:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/09/temp_shuffle_afe02d52-6090-4f3f-be2e-6675591458ab java.io.FileNotFoundException: /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/09/temp_shuffle_afe02d52-6090-4f3f-be2e-6675591458ab (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:228) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/03/31 14:40:08 ERROR sort.BypassMergeSortShuffleWriter: Error while deleting file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/09/temp_shuffle_afe02d52-6090-4f3f-be2e-6675591458ab 17/03/31 14:40:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/23/temp_shuffle_09163fbe-75db-43e5-a49c-d857ff9c053f java.io.FileNotFoundException: /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/23/temp_shuffle_09163fbe-75db-43e5-a49c-d857ff9c053f (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:228) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/03/31 14:40:08 ERROR sort.BypassMergeSortShuffleWriter: Error while deleting file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/23/temp_shuffle_09163fbe-75db-43e5-a49c-d857ff9c053f 17/03/31 14:40:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/08/temp_shuffle_013dc28d-4216-49fe-80e1-e253f96c3ffe java.io.FileNotFoundException: /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/08/temp_shuffle_013dc28d-4216-49fe-80e1-e253f96c3ffe (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:228) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/03/31 14:40:08 ERROR sort.BypassMergeSortShuffleWriter: Error while deleting file /home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1490878649808_0006/blockmgr-6b930ca5-8902-46df-8831-6bfdf0daa213/08/temp_shuffle_013dc28d-4216-49fe-80e1-e253f96c3ffe

apli commented 7 years ago

Fixed by setting yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml. I'm still confused about how virtual memory calculated? Thanks all.

mumlax commented 6 years ago

I'm getting a similar error when executing the mnist-yarn tutorial:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, HOSTNAME.com): java.lang.NullPointerException
    at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
    at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:184)
    at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:201)
    at com.yahoo.ml.caffe.LmdbRDD$$anon$1.<init>(LmdbRDD.scala:101)
    at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:99)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I already fixed the the lmdb path in lenet_memory_train_test.prototxt, but error still appears.

I'm just wondering, that in CaffeOnSpark/data/lenet_cos_train_test.prototxt is a reference to hdfs:///projects/machine_learning/image_dataset/mnist_train_dataframe/, but the uploaded file in HDFS is named mnist_train_lmdb (same for test). I renamed both, but no change. Is this a possible connection to my problem? I don't even know if lenet_cos_train_test.prototxt belongs to the example as its not included in the spark-submit-command, but it's just an idea.

Can anyone help?

mumlax commented 6 years ago

Finally I found the reason for my problem. Changing a path in the lenet_memory_train_test.prototxt. was a good hint. The "source"-keys have to point to a hdfs:///-path. A file://-path like in the preset prototxt causes an NullPointerException. Easy when you know it, but I think it's noted nowhere.

The wiki definitely needs an update (thats not the only mistake in there). Maybe I'll open a new issue for that.

junshi15 commented 6 years ago

@BlueRayONE thanks for your feedback. If you can open an issue to detail the errors in the wiki, I will have them fixed.

abhishek881 commented 6 years ago

Increase the driver memory and it will be resolved. Pass this as parameter...