yangxue0827 / R2CNN_FPN_Tensorflow

R2CNN: Rotational Region CNN Based on FPN (Tensorflow)
419 stars 139 forks source link

OutOfRangeError in /data/io/read_tfrecord.py at line number 80 #5

Open shaileshvedula opened 6 years ago

shaileshvedula commented 6 years ago

HI

I get the following error when trying to train the model using train1.py on my custom data set. I am using resnet-101 as the back end. Can you please help me out here?

Traceback (most recent call last): File "train1.py", line 262, in train() File "train1.py", line 224, in train fast_rcnn_total_loss, total_loss, train_op]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.OutOfRangeError: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](get_batch/batch/padding_fif o_queue, get_batch/batch/n)]]

Caused by op u'get_batch/batch', defined at: File "train1.py", line 262, in train() File "train1.py", line 36, in train is_training=True) File "../data/io/read_tfrecord.py", line 86, in next_batch dynamic_pad=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 922, in batch name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 716, in _batch dequeued = queue.dequeue_many(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 457, in dequeue_many self._queue_ref, n=n, component_types=self._dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1342, in _queue_dequeue_many_v2 timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](get_batch/batch/padding_fif o_queue, get_batch/batch/n)]]

lyz0305 commented 6 years ago

I met the same problem. I think it's because this training process is not reusing the data. When training step is excessing the number of your training data, this error happens. I think you can rewrite the code of data input part to fix this error

yangxue0827 commented 6 years ago

In fact, the code is no problem, it may be caused by the environment configuration error. I have modified the code, please update.

shaileshvedula commented 6 years ago

It still produces the same error. Can you tell me what change you made?

lyz0305 commented 6 years ago

@1991viet The author yangxue0827 has already modified the code so I think the problem should be solved. You can look into the details of code changing at Jan 30 or around that time.

hinkeret commented 6 years ago

I have the same problem, did anyone solve the problem, thanks.

SandeepSreenivasan commented 6 years ago

I am getting the same error. I have checked the tfrecord path. Path seems to be correct and also tfrecord creation did't give any problem. Is there any solution for this problem?

heping0228 commented 6 years ago

who solve this problem... I also get this problem...

Elsanna commented 5 years ago

I am getting the same error too...

OneSilverBullet commented 5 years ago

The same problem...

lemonaha commented 5 years ago

2018-12-13 07:07:10.108873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:05:00.0, compute capability: 6.1) restore model 2018-12-13 07:10:08.417504: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9 [[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]] 2018-12-13 07:10:08.417533: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9 [[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]] 2018-12-13 07:10:08.417514: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9 [[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]] 2018-12-13 07:10:08.418403: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9 [[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]] 2018-12-13 07:10:58: step1 image_name:1534151804311.jpg | rpn_loc_loss:0.244331941009 | rpn_cla_loss:1.68899667263 | rpn_total_loss:1.93332862854 | fast_rcnn_loc_loss:0.0440079607069 | fast_rcnn_cla_loss:1.43304491043 | fast_rcnn_loc_rotate_loss:0.203424081206 | fast_rcnn_cla_rotate_loss:1.65664339066 | fast_rcnn_total_loss:3.33712053299 | total_loss:6.10644292831 | pre_cost_time:68.6907980442s 2018-12-13 07:12:33: step11 image_name:942.jpg | rpn_loc_loss:0.492506951094 | rpn_cla_loss:3.72918891907 | rpn_total_loss:4.22169589996 | fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:3.67863805195e-07 | fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:3.81841331887e-08 | fast_rcnn_total_loss:4.06047945489e-07 | total_loss:5.05774736404 | pre_cost_time:0.319267988205s 2018-12-13 07:12:36: step21 image_name:114527781.jpg | rpn_loc_loss:0.22121527791 | rpn_cla_loss:0.155304968357 | rpn_total_loss:0.376520246267 | fast_rcnn_loc_loss:0.0451134853065 | fast_rcnn_cla_loss:0.325144588947 | fast_rcnn_loc_rotate_loss:0.188544362783 | fast_rcnn_cla_rotate_loss:0.424986839294 | fast_rcnn_total_loss:0.983789265156 | total_loss:2.19642567635 | pre_cost_time:0.266241073608s 2018-12-13 07:12:39: step31 image_name:111044010.jpg | rpn_loc_loss:0.0363116413355 | rpn_cla_loss:0.156413659453 | rpn_total_loss:0.192725300789 | fast_rcnn_loc_loss:0.0495819486678 | fast_rcnn_cla_loss:0.0632207170129 | fast_rcnn_loc_rotate_loss:0.143409430981 | fast_rcnn_cla_rotate_loss:0.073470339179 | fast_rcnn_total_loss:0.329682469368 | total_loss:1.35856246948 | pre_cost_time:0.27028298378s 2018-12-13 07:12:42: step41 image_name:670.jpg | rpn_loc_loss:0.401484191418 | rpn_cla_loss:0.393361717463 | rpn_total_loss:0.794845938683 | fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.0340446382761 | fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:0.0378245897591 | fast_rcnn_total_loss:0.0718692243099 | total_loss:1.70288407803 | pre_cost_time:28.3492949009s 2018-12-13 07:13:41: step51 image_name:1534935113519.jpg | rpn_loc_loss:0.0820427164435 | rpn_cla_loss:0.138145014644 | rpn_total_loss:0.220187723637 | fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.00435423571616 | fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:0.00467131333426 | fast_rcnn_total_loss:0.00902554951608 | total_loss:1.06538057327 | pre_cost_time:0.26727604866s 2018-12-13 07:14:08: step61 image_name:1535085665101.jpg | rpn_loc_loss:0.116681322455 | rpn_cla_loss:0.216219723225 | rpn_total_loss:0.332901060581 | fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.00808826368302 | fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:0.00720638176426 | fast_rcnn_total_loss:0.0152946449816 | total_loss:1.1843521595 | pre_cost_time:0.269505977631s 2018-12-13 07:14:10.270148: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]] 2018-12-13 07:14:10.270245: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]] 2018-12-13 07:14:10.270418: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]] 2018-12-13 07:14:10.270677: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]] Traceback (most recent call last): File "train1.py", line 264, in train() File "train1.py", line 35, in train next_batch(dataset_name=cfgs.DATASET_NAME, File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.OutOfRangeError: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]

Caused by op u'get_batch/batch', defined at: File "train1.py", line 264, in train() File "train1.py", line 35, in train next_batch(dataset_name=cfgs.DATASET_NAME, File "../data/io/read_tfrecord.py", line 87, in next_batch dynamic_pad=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 927, in batch name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 722, in _batch dequeued = queue.dequeue_many(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 464, in dequeue_many self._queue_ref, n=n, component_types=self._dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2418, in _queue_dequeue_many_v2 component_types=component_types, timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]

it seems to be the same problem when python train1.py

lemonaha commented 5 years ago

I found out that the reason might be some of the xml file. There are some image has no gtbox, we have to skip the data when we convert them to tfrecord! just add after line 97

img_height, img_width, gtbox_label = read_xml_gtbox_and_label(xml)
         if gtbox_label.shape[0] <= 0:
             continue
ChaoFan96 commented 5 years ago

The same issue coming across when I bring my own dataset.

ChaoFan96 commented 5 years ago

The same issue coming across when I bring my own dataset.

I solve the problem by ensuring the accuracy of original dataset, I guess any data error(include: data_path, data_fromat, data_shape, etc) will cause this issues, just guess.

EricYangsw commented 5 years ago

In my case, the data format is wrong. My .xml file record the bndbox by (Xmin、Xmax、Ymin、Ymax). After I converted it became (x1, y1, x2, y2, x3, y3, x4, y4) in convert_data_to_tfrecord.py (function: read_xml_gtbox_and_label ), it can train now.

            # original code:
            if child_item.tag == 'bndbox':
                tmp_box = []
                for node in child_item:
                    tmp_box.append(int(node.text))

            # My Modification:
            if child_item.tag == 'bndbox':
                orig_tmp_box = []
                tmp_box = []
                for node in child_item:
                    orig_tmp_box.append(int(node.text))

                for my_idx in [0,1,2,1,2,3,0,3,]:
                    tmp_box.append(orig_tmp_box[my_idx])
                assert label is not None, 'label is none, error'
                tmp_box.append(label)
                box_list.append(tmp_box)
viibridges commented 5 years ago

I think I had found what causes the problem. In my case, I encountered the same error after I removed some training examples by applying some filters in data/io/convert_data_to_tfrecord.py. It looks like you have to close the tfrecord writer handle after you finish the conversion to prevent the problem from happening. just put a line: writer.close() to the end of data/io/convert_data_to_tfrecord.py, the problem will be gone.

Zappytoes commented 4 years ago

I was able to overcome this error in Google Colab by reducing the amount of data i fed into the tfrecord. My original tfrecord for all my data was around 16Gb. I broke up my data into smaller ~3Gb tfrecords (this was about 1000 1024x1024 images with annotations). I then trained a detector using the first tfrecord, and then training ended, I resumed training with the next tfrecord.

HUI11126 commented 3 years ago

https://github.com/yangxue0827/FPN_Tensorflow/issues/35#issuecomment-414103141 按照2楼的方法,先跑一下data/io/convert_data_to_tfrecord.py,在data/tfrecord下面会生成一个比数据集大很多的文件,然后跑train.py就行了。

这里的data/tfrecord文件夹需要新建,源代码data下面没有