ImageNet training job fails since there is not enough virtual memory

DwyaneShi commented 7 years ago

I'm now evaluating CaffeOnSpark with CIFAR-10, MNIST and AlexNet models. Jobs with CIFAR-10 and MNIST work well, but the job with AlexNet fails since there is not enough virtual memory to hold the entire ImageNet training dataset (ILSVRC2012).

Here are some details of GPU node:

125GB memory
108GB HDD

The job has 16 executors with 2 GPUs/executor, and the source class in the train_val.prototxt is SeqImageDataSource (sequence files are generated by com.yahoo.ml.caffe.tools.Binary2Sequence).

Here is the command: spark-submit --master yarn --deploy-mode cluster --num-executors 16 --executor-memory 90g --files ${HOME}/CaffeOnSpark/data/alexnet_solver.prototxt,${HOME}/CaffeOnSpark/data/alexnet_train_val.prototxt,${HOME}/CaffeOnSpark/data/imagenet_mean.binaryproto --conf spark.driver.extraLibraryPath=XXX --conf spark.executorEnv.LD_LIBRARY_PATH=XXX --class com.yahoo.ml.caffe.CaffeOnSpark ${HOME}/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -clusterSize 16 -train -features accuracy,loss -label label -conf alexnet_solver.prototxt -devices 2 -connection ethernet -model /projects/machine_learning/output/alexnet.model -output /projects/machine_learning/output/alexnet_features_result

Finally, YARN complains that: 2017-05-13 22:52:01,504 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 28800 for c ontainer-id container_1494729922386_0001_01_000090: 1.9 GB of 99 GB physical memory used; 255.6 GB of 396 GB virtual memory used

Is there any way to run CaffeOnSpark successfully with such a big dataset like ImageNet in our GPU nodes?

junshi15 commented 7 years ago

It should work. We have successfully trained Inception network on imagenet dataset. I do not see memory violation from the line you posted, especially only 1.9GB physical memory was used. For the dataset format, we initially used sequence file, then later we used data frame, both worked.

DwyaneShi commented 7 years ago

More YARN logs: 2017-05-13 22:53:28,241 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_ 1494729922386_0001_02_000016 and exit code: 134 ExitCodeException exitCode=134: /bin/bash: line 1: 28986 Aborted (core dumped) ...

The container exits with code 134, because of a "core dumped" error. And as shown in the previous log "1.9 GB of 99 GB physical memory used; 255.6 GB of 396 GB virtual memory used", you might recognize that currently consumed virtual memory (255.6GB) > memory (125GB) + disk (108GB) in our GPU node, so I think this is why it fails. And I tried with the same configuration multiple times, all of them failed when virtual memory reached about 250GB.

yahoo / CaffeOnSpark

ImageNet training job fails since there is not enough virtual memory #255