Closed Zzmc closed 6 years ago
what's your max number of iterations? Has it been reached?
@junshi15
no, it is not, i have sloved this problem, it looks like due to the solver paramter "test_interval",when i changed this number it can be running againe, but it will stop after a while.Then i looks the source code
val no_of_records_required_per_partition_train = conf.solverParameter.getTestInterval() * sourceTrain.batchSize() * conf.devices
val total_records_train = trainDataRDD.count()
log.info("total_records_train: " + total_records_train)
log.info("no_of_records_required_per_partition_train: " + no_of_records_required_per_partition_train)
if (total_records_train < no_of_records_required_per_partition_train * conf.clusterSize) { throw new IllegalStateException("Insufficient training data. Please adjust hyperparameters or increase dataset.") }
val no_of_partitions_train = (total_records_train/no_of_records_required_per_partition_train).toInt
log.info("num of training partitions: " + no_of_partitions_train)
, do you think it's the key to my problem?
Could you give me some suggestions? Thank you.
Can you try to disable the test phase by doing the following in your solver.prototxt file?
test_iter: 0 test_interval: 0
@junshi15 Thanks, it is working!
hi, i have 3 node,when i train a model, i save the mode every 5,000 times,but when it save the model,it is stopped,there is no Error information,what should i do, what's wrong with that?