yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

it is stopped, after Storage a model #279

Closed Zzmc closed 6 years ago

Zzmc commented 7 years ago

hi, i have 3 node,when i train a model, i save the mode every 5,000 times,but when it save the model,it is stopped,there is no Error information,what should i do, what's wrong with that?

junshi15 commented 7 years ago

what's your max number of iterations? Has it been reached?

Zzmc commented 7 years ago

@junshi15 no, it is not, i have sloved this problem, it looks like due to the solver paramter "test_interval",when i changed this number it can be running againe, but it will stop after a while.Then i looks the source code val no_of_records_required_per_partition_train = conf.solverParameter.getTestInterval() * sourceTrain.batchSize() * conf.devices val total_records_train = trainDataRDD.count() log.info("total_records_train: " + total_records_train) log.info("no_of_records_required_per_partition_train: " + no_of_records_required_per_partition_train) if (total_records_train < no_of_records_required_per_partition_train * conf.clusterSize) { throw new IllegalStateException("Insufficient training data. Please adjust hyperparameters or increase dataset.") } val no_of_partitions_train = (total_records_train/no_of_records_required_per_partition_train).toInt log.info("num of training partitions: " + no_of_partitions_train), do you think it's the key to my problem? Could you give me some suggestions? Thank you.

junshi15 commented 7 years ago

Can you try to disable the test phase by doing the following in your solver.prototxt file?

test_iter: 0 test_interval: 0

Zzmc commented 6 years ago

@junshi15 Thanks, it is working!