yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

Executor may hung when using multiple devices(GPU) #243

Open heliumsun opened 7 years ago

heliumsun commented 7 years ago

I've encountered executor hung issue for several times. It randomly happens when I submit several CaffeOnSpark tasks and each of the task is using 2 devices(GPU).

Call stack:

2017-03-30 10:40:51 Full thread dump OpenJDK 64-Bit Server VM (25.65-b01 mixed mode):

"ForkJoinPool-1-worker-5" #128 daemon prio=5 os_prio=0 tid=0x0000120011960800 nid=0x1637b waiting on condition [0x00001000bbdfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"ForkJoinPool-1-worker-4" #127 daemon prio=5 os_prio=0 tid=0x000010013004a000 nid=0x1637a runnable [0x000010013920d000] java.lang.Thread.State: RUNNABLE at com.yahoo.ml.jcaffe.CaffeNet.train(Native Method) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$com$yahoo$ml$caffe$CaffeProcessor$$doTrain$2.apply$mcVI$sp(CaffeProcessor.scala:447) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$com$yahoo$ml$caffe$CaffeProcessor$$doTrain$2.apply(CaffeProcessor.scala:428) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$com$yahoo$ml$caffe$CaffeProcessor$$doTrain$2.apply(CaffeProcessor.scala:428) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at com.yahoo.ml.caffe.CaffeProcessor.com$yahoo$ml$caffe$CaffeProcessor$$doTrain(CaffeProcessor.scala:428) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$startThreads$1$$anonfun$apply$mcVI$sp$1.apply$mcV$sp(CaffeProcessor.scala:145) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$startThreads$1$$anonfun$apply$mcVI$sp$1.apply(CaffeProcessor.scala:145) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$startThreads$1$$anonfun$apply$mcVI$sp$1.apply(CaffeProcessor.scala:145) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

"ForkJoinPool-1-worker-3" #126 daemon prio=5 os_prio=0 tid=0x0000120011960000 nid=0x16379 runnable [0x000010013940d000] java.lang.Thread.State: RUNNABLE at com.yahoo.ml.caffe.CaffeProcessor.takeFromQueue(CaffeProcessor.scala:232) at com.yahoo.ml.caffe.CaffeProcessor.com$yahoo$ml$caffe$CaffeProcessor$$doTransform(CaffeProcessor.scala:336) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$startThreads$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcZI$sp$2.apply$mcV$sp(CaffeProcessor.scala:159) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$startThreads$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcZI$sp$2.apply(CaffeProcessor.scala:159) at com.yahoo.ml.caffe.CaffeProcessor$$anonfun$startThreads$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcZI$sp$2.apply(CaffeProcessor.scala:159) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

"ForkJoinPool-1-worker-2" #125 daemon prio=5 os_prio=0 tid=0x000012001195f000 nid=0x16378 waiting on condition [0x0000100139a0d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"ForkJoinPool-1-worker-1" #124 daemon prio=5 os_prio=0 tid=0x000012001195e800 nid=0x16377 waiting on condition [0x00001000bb9fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"shuffle-client-0" #114 daemon prio=5 os_prio=0 tid=0x00001001a8095000 nid=0x160ed runnable [0x000010018f61d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"Executor task launch worker-0" #118 daemon prio=5 os_prio=0 tid=0x000010011c07e000 nid=0x160e0 runnable [0x000010018f01c000] java.lang.Thread.State: RUNNABLE at com.yahoo.ml.caffe.CaffeProcessor.feedQueue(CaffeProcessor.scala:195) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$10$$anonfun$11.apply(CaffeOnSpark.scala:330) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$10$$anonfun$11.apply(CaffeOnSpark.scala:330) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195) at scala.collection.AbstractIterator.reduce(Iterator.scala:1157) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$10.apply(CaffeOnSpark.scala:330) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$10.apply(CaffeOnSpark.scala:325) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$21.apply(RDD.scala:730) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$21.apply(RDD.scala:730) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:309) at org.apache.spark.rdd.RDD.iterator(RDD.scala:273) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

"driver-heartbeater" #117 daemon prio=5 os_prio=0 tid=0x000010011c07b000 nid=0x160d8 waiting on condition [0x000010018ee1d000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"shuffle-client-0" #116 daemon prio=5 os_prio=0 tid=0x000010011c075800 nid=0x160d2 runnable [0x000010018dc0d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"shuffle-server-0" #115 daemon prio=5 os_prio=0 tid=0x000010011c051800 nid=0x160d1 runnable [0x000010018da0d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"metrics-logger-reporter-1-thread-1" #113 daemon prio=5 os_prio=0 tid=0x0000100004c83000 nid=0x160cf waiting on condition [0x000010018d80d000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"BROADCAST_VARS cleanup timer" #112 daemon prio=5 os_prio=0 tid=0x0000100004bac000 nid=0x160ce in Object.wait() [0x000010018d40d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

"BLOCK_MANAGER cleanup timer" #111 daemon prio=5 os_prio=0 tid=0x0000100004baa800 nid=0x160cd in Object.wait() [0x000010018d20d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

"shuffle-client-0" #87 daemon prio=5 os_prio=0 tid=0x0000100180001000 nid=0x160cc runnable [0x000010013be0d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"netty-rpc-env-timeout" #109 daemon prio=5 os_prio=0 tid=0x0000100004b52800 nid=0x160cb waiting on condition [0x000010013bc0d000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"New I/O server boss #6" #107 daemon prio=5 os_prio=0 tid=0x0000100160123800 nid=0x160c9 runnable [0x000010013b60d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"New I/O worker #5" #106 daemon prio=5 os_prio=0 tid=0x000010016003c800 nid=0x160c8 runnable [0x000010013b40d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"New I/O worker #4" #105 daemon prio=5 os_prio=0 tid=0x00001001600a2800 nid=0x160c7 runnable [0x000010013b20d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"New I/O boss #3" #104 daemon prio=5 os_prio=0 tid=0x000010016003d800 nid=0x160c6 runnable [0x000010013b00d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"New I/O worker #2" #102 daemon prio=5 os_prio=0 tid=0x0000100160032000 nid=0x160c5 runnable [0x000010013ae0d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"New I/O worker #1" #101 daemon prio=5 os_prio=0 tid=0x0000100160030800 nid=0x160c4 runnable [0x000010013ac0d000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-6" #100 daemon prio=5 os_prio=0 tid=0x0000100004b35800 nid=0x160c3 waiting on condition [0x000010013aa0d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-5" #99 daemon prio=5 os_prio=0 tid=0x000010015800d000 nid=0x160c2 waiting on condition [0x000010013a80d000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"sparkExecutorActorSystem-akka.actor.default-dispatcher-4" #98 daemon prio=5 os_prio=0 tid=0x0000100158001000 nid=0x160c1 waiting on condition [0x000010013a40d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"sparkExecutorActorSystem-scheduler-1" #95 daemon prio=5 os_prio=0 tid=0x0000100004a83000 nid=0x160be waiting on condition [0x0000100139e0d000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at akka.actor.LightArrayRevolverScheduler.waitNanos(Scheduler.scala:226) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:405) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) at java.lang.Thread.run(Thread.java:745)

"dispatcher-event-loop-31" #86 daemon prio=5 os_prio=0 tid=0x000010000491c800 nid=0x160bd waiting on condition [0x00001000b8dfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-30" #85 daemon prio=5 os_prio=0 tid=0x000010000491b800 nid=0x160bc waiting on condition [0x00001000b8bfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-29" #84 daemon prio=5 os_prio=0 tid=0x000010000491a800 nid=0x160bb waiting on condition [0x00001000b89fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-28" #83 daemon prio=5 os_prio=0 tid=0x0000100004919000 nid=0x160ba waiting on condition [0x00001000b87fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-27" #82 daemon prio=5 os_prio=0 tid=0x0000100004918000 nid=0x160b9 waiting on condition [0x00001000b85fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-26" #81 daemon prio=5 os_prio=0 tid=0x0000100004916800 nid=0x160b8 waiting on condition [0x00001000b83fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-25" #80 daemon prio=5 os_prio=0 tid=0x0000100004910800 nid=0x160b7 waiting on condition [0x00001000b81fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-24" #79 daemon prio=5 os_prio=0 tid=0x000010000490f800 nid=0x160b6 waiting on condition [0x0000100077fbd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-23" #78 daemon prio=5 os_prio=0 tid=0x000010000474f000 nid=0x160b5 waiting on condition [0x0000100077dbd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-22" #77 daemon prio=5 os_prio=0 tid=0x0000100004760800 nid=0x160b4 waiting on condition [0x0000100077bbd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-21" #76 daemon prio=5 os_prio=0 tid=0x0000100004913800 nid=0x160b3 waiting on condition [0x00001000779bd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-20" #75 daemon prio=5 os_prio=0 tid=0x000010000475d000 nid=0x160b2 waiting on condition [0x0000100076f6d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-19" #74 daemon prio=5 os_prio=0 tid=0x000010000475e800 nid=0x160b1 waiting on condition [0x00001000b8ffd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-18" #73 daemon prio=5 os_prio=0 tid=0x000010000475b000 nid=0x160b0 waiting on condition [0x00001000b91fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-17" #72 daemon prio=5 os_prio=0 tid=0x0000100004759000 nid=0x160af waiting on condition [0x00001000b93fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-16" #71 daemon prio=5 os_prio=0 tid=0x0000100004766000 nid=0x160ae waiting on condition [0x00001000b95fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-15" #70 daemon prio=5 os_prio=0 tid=0x0000100004762000 nid=0x160ad waiting on condition [0x00001000b97fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-14" #69 daemon prio=5 os_prio=0 tid=0x0000100004764000 nid=0x160ac waiting on condition [0x00001000b99fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-13" #68 daemon prio=5 os_prio=0 tid=0x0000100004751000 nid=0x160ab waiting on condition [0x00001000b9bfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-12" #67 daemon prio=5 os_prio=0 tid=0x000010000478d000 nid=0x160aa waiting on condition [0x00001000b9dfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-11" #66 daemon prio=5 os_prio=0 tid=0x0000100004785000 nid=0x160a9 waiting on condition [0x00001000b9ffd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-10" #65 daemon prio=5 os_prio=0 tid=0x000010000478b000 nid=0x160a8 waiting on condition [0x00001000ba1fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-9" #64 daemon prio=5 os_prio=0 tid=0x000010000476a800 nid=0x160a7 waiting on condition [0x00001000ba3fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-8" #63 daemon prio=5 os_prio=0 tid=0x000010000476c800 nid=0x160a6 waiting on condition [0x00001000ba5fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-7" #62 daemon prio=5 os_prio=0 tid=0x0000100004770800 nid=0x160a5 waiting on condition [0x00001000ba7fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-6" #61 daemon prio=5 os_prio=0 tid=0x0000100004772800 nid=0x160a4 waiting on condition [0x00001000ba9fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-5" #60 daemon prio=5 os_prio=0 tid=0x0000100004774800 nid=0x160a3 waiting on condition [0x00001000babfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-4" #59 daemon prio=5 os_prio=0 tid=0x0000100004778800 nid=0x160a2 waiting on condition [0x00001000badfd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-3" #58 daemon prio=5 os_prio=0 tid=0x000010000477a800 nid=0x160a1 waiting on condition [0x00001000baffd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-2" #57 daemon prio=5 os_prio=0 tid=0x000010000477f000 nid=0x160a0 waiting on condition [0x00001000bb1fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-1" #56 daemon prio=5 os_prio=0 tid=0x0000100004781000 nid=0x1609f waiting on condition [0x00001000bb3fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"dispatcher-event-loop-0" #55 daemon prio=5 os_prio=0 tid=0x0000100004783000 nid=0x1609e waiting on condition [0x00001000bb5fd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"threadDeathWatcher-2-1" #54 daemon prio=1 os_prio=0 tid=0x0000100130011800 nid=0x16096 waiting on condition [0x00001000bbffd000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:137) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745)

"Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00001000040fc000 nid=0x1602e runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00001000040f1800 nid=0x16028 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00001000040e4000 nid=0x16027 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00001000040e2000 nid=0x16025 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00001000040b4800 nid=0x16022 in Object.wait() [0x0000100074d6d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00001000040b2800 nid=0x16021 in Object.wait() [0x0000100074b6d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

"main" #1 prio=5 os_prio=0 tid=0x0000100004009800 nid=0x16008 waiting on condition [0x000010000135c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"VM Thread" os_prio=0 tid=0x00001000040ad000 nid=0x16020 runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x000010000401e800 nid=0x16009 runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x0000100004020000 nid=0x1600a runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x0000100004022000 nid=0x1600b runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x0000100004024000 nid=0x1600c runnable

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x0000100004025800 nid=0x1600d runnable

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x0000100004027800 nid=0x1600e runnable

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x0000100004029800 nid=0x1600f runnable

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x000010000402b000 nid=0x16010 runnable

"GC task thread#8 (ParallelGC)" os_prio=0 tid=0x000010000402d000 nid=0x16011 runnable

"GC task thread#9 (ParallelGC)" os_prio=0 tid=0x000010000402e800 nid=0x16012 runnable

"GC task thread#10 (ParallelGC)" os_prio=0 tid=0x0000100004030800 nid=0x16013 runnable

"GC task thread#11 (ParallelGC)" os_prio=0 tid=0x0000100004032800 nid=0x16014 runnable

"GC task thread#12 (ParallelGC)" os_prio=0 tid=0x0000100004034000 nid=0x16015 runnable

"GC task thread#13 (ParallelGC)" os_prio=0 tid=0x0000100004036000 nid=0x16016 runnable

"GC task thread#14 (ParallelGC)" os_prio=0 tid=0x0000100004038000 nid=0x16017 runnable

"GC task thread#15 (ParallelGC)" os_prio=0 tid=0x0000100004039800 nid=0x16018 runnable

"GC task thread#16 (ParallelGC)" os_prio=0 tid=0x000010000403b800 nid=0x16019 runnable

"GC task thread#17 (ParallelGC)" os_prio=0 tid=0x000010000403d000 nid=0x1601a runnable

"GC task thread#18 (ParallelGC)" os_prio=0 tid=0x000010000403f000 nid=0x1601b runnable

"GC task thread#19 (ParallelGC)" os_prio=0 tid=0x0000100004041000 nid=0x1601c runnable

"GC task thread#20 (ParallelGC)" os_prio=0 tid=0x0000100004042800 nid=0x1601d runnable

"GC task thread#21 (ParallelGC)" os_prio=0 tid=0x0000100004044800 nid=0x1601e runnable

"GC task thread#22 (ParallelGC)" os_prio=0 tid=0x0000100004046800 nid=0x1601f runnable

"VM Periodic Task Thread" os_prio=0 tid=0x0000100004111000 nid=0x16030 waiting on condition

JNI global references: 441

Heap PSYoungGen total 320512K, used 109661K [0x00000000eab00000, 0x0000000100000000, 0x0000000100000000) eden space 294400K, 35% used [0x00000000eab00000,0x00000000f1077a90,0x00000000fca80000) from space 26112K, 22% used [0x00000000fca80000,0x00000000fd01fd48,0x00000000fe400000) to space 24576K, 0% used [0x00000000fe800000,0x00000000fe800000,0x0000000100000000) ParOldGen total 699392K, used 86025K [0x00000000c0000000, 0x00000000eab00000, 0x00000000eab00000) object space 699392K, 12% used [0x00000000c0000000,0x00000000c54027d8,0x00000000eab00000) Metaspace used 42265K, capacity 42928K, committed 43136K, reserved 1085440K class space used 5945K, capacity 6154K, committed 6272K, reserved 1048576K

junshi15 commented 7 years ago

Try to disable validation by setting the following in your solver.prototxt file:

test_iter: 0 test_interval: 0