yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

CaffeOnSpark with mesos ? #212

Closed davvdg closed 7 years ago

davvdg commented 7 years ago

Hi

Here, we're trying to use CaffeOnSpark with mesos as the scheduler, rather than yarn. CaffeOnSpark seems to be really focused on Yarn, and we're not sure how to specify the resource allocation when submitting our test job.

As we like to play, here is some additionnal constraints:

1st thing is: spark on mesos is dynamic and will try to gather as much cores and as much executors as possible. We may not know that in advances, so how can we setup the "-device" and the "-clusterSize" parameters ?

2nd thing is: we've tried to constrain the geometry of the subset of machines allowed to one caffeonspark job using the following conf options:

Under some specific context (all docker images being ready to start at job start prior to send the tasks to the executors) the job 0 starts and all machines get a piece of work to do. At the end of the first stage though, I'm encountering the following message:

INFO CaffeOnSpark: total_records_train: 50000
INFO CaffeOnSpark: no_of_records_required_per_partition_train: 25600
Exception in thread "main" java.lang.IllegalStateException: Insufficient training data. Please adjust hyperparameters or increase dataset.
    at com.yahoo.ml.caffe.CaffeOnSpark.trainWithValidation(CaffeOnSpark.scala:261)

I've been able to reproduce this error with less than 10 machines or more than 10, without changing the -clustersize option. I've been through other kind of funny things but I will post more dedicated issues for them.

Anyway, any idea about what could cause this error ?

davvdg commented 7 years ago

All right... I found out to have the cifar test running. It was just a problem with the layer description of the dataset (batchsize, image size and channels). I'll be back with more questions...