zsyzzsoft / co-mod-gan

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks
Other
444 stars 67 forks source link

The training process was killed #20

Open yezhanglang opened 3 years ago

yezhanglang commented 3 years ago

Hi,

Thank you for your great job. However, when I using co-mod-gan to train my own dataset, the process was killed after the first loop. My device contain 1 v100 gpus (which has 28GB gpu memory) and 4 cpu (which has 20GB cpu memory for each one). And I set the batch size to 1: sched.minibatch_size_base = 1 sched.minibatch_gpu_base = 1

image

And when I using dataset_tools/create_from_images.py with --shuffle to convert raw images into TFRecords, it stoped at processing the last image. But I still use it for training. I use the following command to create data:

python /ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dataset_tools/create_from_images.py --train-image-dir="/GlobalData/ums/image_inpaint/co-mod-gan/train_data" --val-image-dir="/GlobalData/ums/image_inpaint/co-mod-gan/test_data" --tfrecord-dir="/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/tfrecord/food" --shuffle

image

I wonder why the process was killed? Something wrong with my dataset? Or my device gpu/cpu memory is not enough

Best regards

yezhanglang commented 3 years ago

I turn to use 1090 which has 4 gpus (each has9GB gpu memory) and 8 cpu (which has 80GB cpu memory for each one), it can start to train. However, it fails after several steps.

Here is the error infomation:

truncation=None 100%|#############################################################################################################################################################################| 313/313 [02:05<00:00, 2.59it/s] network-snapshot-050060 time 4m 45s ids10k-FID 14.2303 ids10k-U 0.1587 ids10k-P 0.0406 tick 11 kimg 50066.0 lod 0.00 minibatch 4 time 2h 22m 29s sec/tick 707.2 sec/kimg 117.87 maintenance 330.4 gpumem 8.1

tick 12 kimg 50072.0 lod 0.00 minibatch 4 time 2h 34m 17s sec/tick 707.4 sec/kimg 117.90 maintenance 0.0 gpumem 8.1 tick 13 kimg 50078.0 lod 0.00 minibatch 4 time 2h 46m 04s sec/tick 707.2 sec/kimg 117.87 maintenance 0.0 gpumem 8.1 tick 14 kimg 50084.0 lod 0.00 minibatch 4 time 2h 57m 51s sec/tick 707.4 sec/kimg 117.89 maintenance 0.0 gpumem 8.1 tick 15 kimg 50090.0 lod 0.00 minibatch 4 time 3h 09m 39s sec/tick 707.8 sec/kimg 117.97 maintenance 0.0 gpumem 8.1 tick 16 kimg 50096.0 lod 0.00 minibatch 4 time 3h 21m 27s sec/tick 708.1 sec/kimg 118.01 maintenance 0.0 gpumem 8.1 tick 17 kimg 50102.0 lod 0.00 minibatch 4 time 3h 33m 16s sec/tick 709.1 sec/kimg 118.18 maintenance 0.0 gpumem 8.1 Traceback (most recent call last): File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.DataLossError: 2 root error(s) found. (0) Data loss: truncated record at 178811444399 [[{{node GPU2/DataFetch/IteratorGetNext}}]] [[GPU1/DataFetch/IteratorGetNext/_12103]] (1) Data loss: truncated record at 178811444399 [[{{node GPU2/DataFetch/IteratorGetNext}}]] 0 successful operations. 3 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/run_training.py", line 133, in main() File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/run_training.py", line 128, in main run(vars(args)) File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/run_training.py", line 71, in run dnnlib.submit_run(kwargs) File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/submission/submit.py", line 343, in submit_run return farm.submit(submit_config, host_run_dir) File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/submission/internal/local.py", line 22, in submit return run_wrapper(submit_config) File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/submission/submit.py", line 280, in run_wrapper run_func_obj(*submit_config.run_func_kwargs) File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/training_loop.py", line 307, in training_loop tflib.run(data_fetch_op, feed_dict) File "/ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/tflib/tfutil.py", line 31, in run return tf.get_default_session().run(args, **kwargs) File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.DataLossError: 2 root error(s) found. (0) Data loss: truncated record at 178811444399 [[node GPU2/DataFetch/IteratorGetNext (defined at ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/dataset.py:171) ]] [[GPU1/DataFetch/IteratorGetNext/_12103]] (1) Data loss: truncated record at 178811444399 [[node GPU2/DataFetch/IteratorGetNext (defined at ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/dataset.py:171) ]] 0 successful operations. 3 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node GPU2/DataFetch/IteratorGetNext: Dataset/IteratorV2 (defined at ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/dataset.py:145)

Input Source operations connected to node GPU2/DataFetch/IteratorGetNext: Dataset/IteratorV2 (defined at ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/dataset.py:145)

Original stack trace for 'GPU2/DataFetch/IteratorGetNext': File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/run_training.py", line 133, in main() File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/run_training.py", line 128, in main run(vars(args)) File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/run_training.py", line 71, in run dnnlib.submit_run(kwargs) File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/submission/submit.py", line 343, in submit_run return farm.submit(submit_config, host_run_dir) File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/submission/internal/local.py", line 22, in submit return run_wrapper(submit_config) File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/dnnlib/submission/submit.py", line 280, in run_wrapper run_func_obj(*submit_config.run_func_kwargs) File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/training_loop.py", line 215, in training_loop reals_write, labels_write = training_set.get_minibatch_tf() File "ProjectRoot/python_workspace/image_inpaint/co-mod-gan/training/dataset.py", line 171, in get_minibatch_tf return self._tf_iterator.get_next() File "UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 426, in get_next output_shapes=self._structure._flat_shapes, name=name) File "UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1947, in iterator_get_next output_shapes=output_shapes, name=name) File "UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "UserData/software/anaconda3/envs/python3.6_lowlight_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

zsyzzsoft commented 3 years ago

The dataset seems broken. Please try creating the dataset again.