Error while training the model

sreesindhu-sabbineni commented 4 years ago

I am trying to reproduce the results but I am getting the below error while training the model.

Traceback (most recent call last):
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot batch tensors with different shapes in component 1. First element had shape [321,321,1] and element 2 had shape [321,321,3].
         [[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[?,321,321,3], [?,321,321,1], [?,21], [?,41,41,21], [?]], output_types=[DT_FLOAT, DT_UINT8, DT_FLOAT, DT_FLOAT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DSRG.py", line 407, in <module>
    dsrg.train(base_lr=lr,weight_decay=5e-4,momentum=0.9,batch_size=batch_size,epoches=epoches)
  File "DSRG.py", line 377, in train
    self.sess.run(self.net["accum_gradient_accum"])
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot batch tensors with different shapes in component 1. First element had shape [321,321,1] and element 2 had shape [321,321,3].
         [[node IteratorGetNext (defined at DSRG.py:347)  = IteratorGetNext[output_shapes=[[?,321,321,3], [?,321,321,1], [?,21], [?,41,41,21], [?]], output_types=[DT_FLOAT, DT_UINT8, DT_FLOAT, DT_FLOAT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]

Caused by op 'IteratorGetNext', defined at:
  File "DSRG.py", line 407, in <module>
    dsrg.train(base_lr=lr,weight_decay=5e-4,momentum=0.9,batch_size=batch_size,epoches=epoches)
  File "DSRG.py", line 347, in train
    data_x,data_y,data_tag,data_cues,id_of_image,iterator_train = self.data.next_batch(category="train",batch_size=batch_size,epoches=-1)
  File "/home/ssindhu/winter2020/DSRG-tensorflow/pythonlib/dataset_DSRG.py", line 109, in next_batch
    img,gt,tag,cues,id_ = iterator.get_next()
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 421, in get_next
    name=name)), self._output_types,
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ssindhu/deeplab_env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Cannot batch tensors with different shapes in component 1. First element had shape [321,321,1] and element 2 had shape [321,321,3].
         [[node IteratorGetNext (defined at DSRG.py:347)  = IteratorGetNext[output_shapes=[[?,321,321,3], [?,321,321,1], [?,21], [?,41,41,21], [?]], output_types=[DT_FLOAT, DT_UINT8, DT_FLOAT, DT_FLOAT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]

xtudbxk commented 4 years ago

It seems there exists some grayscale images in the official dataset. You can modify the code to fit the situation. And I maybe update some modificaitions to this project after I validate this guess in a few hours.

sreesindhu-sabbineni commented 4 years ago

Thanks for the quick reply. But as per the error message my understanding is that the difference in dimensions between ground truth label (segmentation ground truth) and original JPEG image is causing the error which it shouldn't. Please correct me if I am wrong. Also, if there are grey scale images, can we simply bypass them by checking the dimensions of the input image?

sreesindhu-sabbineni commented 4 years ago

I have the same issue while training SEC-tensorflow as well. Do you think the problem lies in the JPEG Images dataset of PASCAL VOC?

Also, I think epoch 0 is running fine as I get the below output before the error.

start_time: 1579081217.421928 config -- lr:0.001000 weight_decay:0.000500 momentum:0.900000 batch_size:4.000000 epoches:32.000000 epoch:0.000000, iteration:0.000000, lr:0.001000, loss:13.516954 seed_loss:12.225690,constrain_loss:0.522553

xtudbxk commented 4 years ago

After checking the images and labels in JPEGImages and SegmentationClassAug folders respectively, I agree with your opinion that the errors arise from the dimensions of ground truth labels. Therefore, I just fixed this bug and retested it. And it now seems to work in my local computer. I also uploaded the new commit to this project a few minutes ago. Hope the newest code can solve your problem.

sreesindhu-sabbineni commented 4 years ago

Thank you. I am training the model now and will let you know once it is complete. By the way, I was able to start training it successfully yesterday with the previous version by keeping batch size as 1.

sreesindhu-sabbineni commented 4 years ago

The model was trained successfully and the final loss came to 1.12 Also,would you mind adding the process to evaluate the model in the evaluation project.

xtudbxk / DSRG-tensorflow

Error while training the model #22