yu-iskw / polyaxon-tf-distributed-training

2 stars 0 forks source link

Distributed training doesn't work. #1

Closed yu-iskw closed 5 years ago

yu-iskw commented 5 years ago

The log looks good. But the training looked suspended.

2019-01-23 23:01:44 UTC -- /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

2019-01-23 23:01:46 UTC -- WARNING:tensorflow:From mnist.py:25: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.

2019-01-23 23:01:46 UTC -- Instructions for updating:

2019-01-23 23:01:46 UTC -- Please use alternatives such as official/mnist/dataset.py from tensorflow/models.

2019-01-23 23:01:46 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.

2019-01-23 23:01:46 UTC -- Instructions for updating:

2019-01-23 23:01:46 UTC -- Please write your own downloading logic.

2019-01-23 23:01:46 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.

2019-01-23 23:01:46 UTC -- Instructions for updating:

2019-01-23 23:01:46 UTC -- Please use tf.data to implement this functionality.

2019-01-23 23:01:47 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.

2019-01-23 23:01:47 UTC -- Instructions for updating:

2019-01-23 23:01:47 UTC -- Please use tf.data to implement this functionality.

2019-01-23 23:01:47 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.

2019-01-23 23:01:47 UTC -- Instructions for updating:

2019-01-23 23:01:47 UTC -- Please use alternatives such as official/mnist/dataset.py from tensorflow/models.

2019-01-23 23:01:48 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py:378: multi_class_head (from tensorflow.contrib.learn.python.learn.estimators.head) is deprecated and will be removed in a future version.

2019-01-23 23:01:48 UTC -- Instructions for updating:

2019-01-23 23:01:48 UTC -- Please switch to tf.contrib.estimator.*_head.

2019-01-23 23:01:48 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:1179: __init__ (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.

2019-01-23 23:01:48 UTC -- Instructions for updating:

2019-01-23 23:01:48 UTC -- Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimator.*

2019-01-23 23:01:48 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:427: __init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.

2019-01-23 23:01:48 UTC -- Instructions for updating:

2019-01-23 23:01:48 UTC -- When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.

2019-01-23 23:01:48 UTC -- INFO:tensorflow:Using model_dir in TF_CONFIG: /outputs/yu/tensorflow-mnist/experiments/2395

2019-01-23 23:01:48 UTC -- INFO:tensorflow:Using default config.

2019-01-23 23:01:48 UTC -- INFO:tensorflow:Using config: {'_model_dir': u'/outputs/yu/tensorflow-mnist/experiments/2395', '_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_session_config': None, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc3999d4150>, '_tf_config': gpu_options {

2019-01-23 23:01:48 UTC -- per_process_gpu_memory_fraction: 1.0

2019-01-23 23:01:48 UTC -- }

2019-01-23 23:01:48 UTC -- , '_num_worker_replicas': 3, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_train_distribute': None, '_master': u'grpc://plxjob-master0-20067e1a35f342c5bd74381120aeadd7:2222', '_device_fn': None}

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From mnist.py:36: calling fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with y is deprecated and will be removed after 2016-12-01.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Estimator is decoupled from Scikit Learn interface by moving into

2019-01-23 23:01:49 UTC -- separate class SKCompat. Arguments x, y and batch_size are only

2019-01-23 23:01:49 UTC -- available in the SKCompat class, Estimator will only accept input_fn.

2019-01-23 23:01:49 UTC -- Example conversion:

2019-01-23 23:01:49 UTC -- est = Estimator(...) -> est = SKCompat(Estimator(...))

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From mnist.py:36: calling fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Estimator is decoupled from Scikit Learn interface by moving into

2019-01-23 23:01:49 UTC -- separate class SKCompat. Arguments x, y and batch_size are only

2019-01-23 23:01:49 UTC -- available in the SKCompat class, Estimator will only accept input_fn.

2019-01-23 23:01:49 UTC -- Example conversion:

2019-01-23 23:01:49 UTC -- est = Estimator(...) -> est = SKCompat(Estimator(...))

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From mnist.py:36: calling fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with batch_size is deprecated and will be removed after 2016-12-01.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Estimator is decoupled from Scikit Learn interface by moving into

2019-01-23 23:01:49 UTC -- separate class SKCompat. Arguments x, y and batch_size are only

2019-01-23 23:01:49 UTC -- available in the SKCompat class, Estimator will only accept input_fn.

2019-01-23 23:01:49 UTC -- Example conversion:

2019-01-23 23:01:49 UTC -- est = Estimator(...) -> est = SKCompat(Estimator(...))

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:508: __init__ (from tensorflow.contrib.learn.python.learn.estimators.estimator) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Please switch to the Estimator interface.

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py:142: setup_train_data_feeder (from tensorflow.contrib.learn.python.learn.learn_io.data_feeder) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Please use tensorflow/transform or tf.data.

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:100: extract_pandas_data (from tensorflow.contrib.learn.python.learn.learn_io.pandas_io) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Please access pandas data directly.

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:102: extract_pandas_labels (from tensorflow.contrib.learn.python.learn.learn_io.pandas_io) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Please access pandas data directly.

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:159: __init__ (from tensorflow.contrib.learn.python.learn.learn_io.data_feeder) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Please use tensorflow/transform or tf.data.

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py:340: check_array (from tensorflow.contrib.learn.python.learn.learn_io.data_feeder) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- Please convert numpy dtypes explicitly.

2019-01-23 23:01:49 UTC -- DEBUG:tensorflow:Setting feature info to TensorSignature(dtype=tf.float32, shape=TensorShape([Dimension(None), Dimension(784)]), is_sparse=False).

2019-01-23 23:01:49 UTC -- DEBUG:tensorflow:Setting labels info to TensorSignature(dtype=tf.int32, shape=TensorShape([Dimension(None)]), is_sparse=False)

2019-01-23 23:01:49 UTC -- DEBUG:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=784, default_value=None, dtype=tf.float32, normalizer=None)

2019-01-23 23:01:49 UTC -- WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:678: __new__ (from tensorflow.contrib.learn.python.learn.estimators.model_fn) is deprecated and will be removed in a future version.

2019-01-23 23:01:49 UTC -- Instructions for updating:

2019-01-23 23:01:49 UTC -- When switching to tf.estimator.Estimator, use tf.estimator.EstimatorSpec. You can use the `estimator_spec` method to create an equivalent one.

2019-01-23 23:01:49 UTC -- INFO:tensorflow:Create CheckpointSaverHook.

2019-01-23 23:01:49 UTC -- INFO:tensorflow:Graph was finalized.
harpone commented 5 years ago

Hi @yu-iskw! Have you gotten the distributed training to work yet? I'm planning to try pytorch's DistributedDataParallel with polyaxon, but I'm hesitating a bit because polyaxon can be tricky even without any distributed stuff...

yu-iskw commented 5 years ago

@harpone The current code at master supports distributed training. That is very straightforward. But I haven't tryied potorch's distributed training.