CaffeOnSpark with mesos ?

Here, we're trying to use CaffeOnSpark with mesos as the scheduler, rather than yarn. CaffeOnSpark seems to be really focused on Yarn, and we're not sure how to specify the resource allocation when submitting our test job.

As we like to play, here is some additionnal constraints:

we are submitting to the spark mesos dispatcher running on mesosphere DCOS community edition. The dispatcher then creates the driver which then talks to mesos to get the executors
caffeonspark is executed within a docker container from image bhell/caffeonspark_auto which is built on top of mesosphere/spark 1.6.2

1st thing is: spark on mesos is dynamic and will try to gather as much cores and as much executors as possible. We may not know that in advances, so how can we setup the "-device" and the "-clusterSize" parameters ?

2nd thing is: we've tried to constrain the geometry of the subset of machines allowed to one caffeonspark job using the following conf options:

spark.executor.cores 4
spark.cores.max 40 With the following caffeonspark options: -devices 4 -clusterSize 10

Under some specific context (all docker images being ready to start at job start prior to send the tasks to the executors) the job 0 starts and all machines get a piece of work to do. At the end of the first stage though, I'm encountering the following message:

INFO CaffeOnSpark: total_records_train: 50000
INFO CaffeOnSpark: no_of_records_required_per_partition_train: 25600
Exception in thread "main" java.lang.IllegalStateException: Insufficient training data. Please adjust hyperparameters or increase dataset.
    at com.yahoo.ml.caffe.CaffeOnSpark.trainWithValidation(CaffeOnSpark.scala:261)

I've been able to reproduce this error with less than 10 machines or more than 10, without changing the -clustersize option. I've been through other kind of funny things but I will post more dedicated issues for them.

Anyway, any idea about what could cause this error ?

yahoo / CaffeOnSpark

CaffeOnSpark with mesos ? #212