Open shaileshvedula opened 6 years ago
I met the same problem. I think it's because this training process is not reusing the data. When training step is excessing the number of your training data, this error happens. I think you can rewrite the code of data input part to fix this error
In fact, the code is no problem, it may be caused by the environment configuration error. I have modified the code, please update.
It still produces the same error. Can you tell me what change you made?
@1991viet The author yangxue0827 has already modified the code so I think the problem should be solved. You can look into the details of code changing at Jan 30 or around that time.
I have the same problem, did anyone solve the problem, thanks.
I am getting the same error. I have checked the tfrecord path. Path seems to be correct and also tfrecord creation did't give any problem. Is there any solution for this problem?
who solve this problem... I also get this problem...
I am getting the same error too...
The same problem...
2018-12-13 07:07:10.108873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:05:00.0, compute capability: 6.1)
restore model
2018-12-13 07:10:08.417504: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9
[[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]]
2018-12-13 07:10:08.417533: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9
[[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]]
2018-12-13 07:10:08.417514: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9
[[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]]
2018-12-13 07:10:08.418403: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Input to reshape is a tensor with 1 values, but the requested shape requires a multiple of 9
[[Node: get_batch/Reshape_1 = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](get_batch/DecodeRaw_1/_1061, get_batch/Reshape_1/shape)]]
2018-12-13 07:10:58: step1 image_name:1534151804311.jpg |
rpn_loc_loss:0.244331941009 | rpn_cla_loss:1.68899667263 |
rpn_total_loss:1.93332862854 |
fast_rcnn_loc_loss:0.0440079607069 | fast_rcnn_cla_loss:1.43304491043 |
fast_rcnn_loc_rotate_loss:0.203424081206 | fast_rcnn_cla_rotate_loss:1.65664339066 |
fast_rcnn_total_loss:3.33712053299 |
total_loss:6.10644292831 | pre_cost_time:68.6907980442s
2018-12-13 07:12:33: step11 image_name:942.jpg |
rpn_loc_loss:0.492506951094 | rpn_cla_loss:3.72918891907 |
rpn_total_loss:4.22169589996 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:3.67863805195e-07 |
fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:3.81841331887e-08 |
fast_rcnn_total_loss:4.06047945489e-07 |
total_loss:5.05774736404 | pre_cost_time:0.319267988205s
2018-12-13 07:12:36: step21 image_name:114527781.jpg |
rpn_loc_loss:0.22121527791 | rpn_cla_loss:0.155304968357 |
rpn_total_loss:0.376520246267 |
fast_rcnn_loc_loss:0.0451134853065 | fast_rcnn_cla_loss:0.325144588947 |
fast_rcnn_loc_rotate_loss:0.188544362783 | fast_rcnn_cla_rotate_loss:0.424986839294 |
fast_rcnn_total_loss:0.983789265156 |
total_loss:2.19642567635 | pre_cost_time:0.266241073608s
2018-12-13 07:12:39: step31 image_name:111044010.jpg |
rpn_loc_loss:0.0363116413355 | rpn_cla_loss:0.156413659453 |
rpn_total_loss:0.192725300789 |
fast_rcnn_loc_loss:0.0495819486678 | fast_rcnn_cla_loss:0.0632207170129 |
fast_rcnn_loc_rotate_loss:0.143409430981 | fast_rcnn_cla_rotate_loss:0.073470339179 |
fast_rcnn_total_loss:0.329682469368 |
total_loss:1.35856246948 | pre_cost_time:0.27028298378s
2018-12-13 07:12:42: step41 image_name:670.jpg |
rpn_loc_loss:0.401484191418 | rpn_cla_loss:0.393361717463 |
rpn_total_loss:0.794845938683 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.0340446382761 |
fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:0.0378245897591 |
fast_rcnn_total_loss:0.0718692243099 |
total_loss:1.70288407803 | pre_cost_time:28.3492949009s
2018-12-13 07:13:41: step51 image_name:1534935113519.jpg |
rpn_loc_loss:0.0820427164435 | rpn_cla_loss:0.138145014644 |
rpn_total_loss:0.220187723637 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.00435423571616 |
fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:0.00467131333426 |
fast_rcnn_total_loss:0.00902554951608 |
total_loss:1.06538057327 | pre_cost_time:0.26727604866s
2018-12-13 07:14:08: step61 image_name:1535085665101.jpg |
rpn_loc_loss:0.116681322455 | rpn_cla_loss:0.216219723225 |
rpn_total_loss:0.332901060581 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.00808826368302 |
fast_rcnn_loc_rotate_loss:0.0 | fast_rcnn_cla_rotate_loss:0.00720638176426 |
fast_rcnn_total_loss:0.0152946449816 |
total_loss:1.1843521595 | pre_cost_time:0.269505977631s
2018-12-13 07:14:10.270148: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]
2018-12-13 07:14:10.270245: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]
2018-12-13 07:14:10.270418: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]
2018-12-13 07:14:10.270677: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]
Traceback (most recent call last):
File "train1.py", line 264, in
Caused by op u'get_batch/batch', defined at:
File "train1.py", line 264, in
OutOfRangeError (see above for traceback): PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](get_batch/batch/padding_fifo_queue, get_batch/batch/n)]]
it seems to be the same problem when python train1.py
I found out that the reason might be some of the xml file. There are some image has no gtbox, we have to skip the data when we convert them to tfrecord! just add after line 97
img_height, img_width, gtbox_label = read_xml_gtbox_and_label(xml)
if gtbox_label.shape[0] <= 0:
continue
The same issue coming across when I bring my own dataset.
The same issue coming across when I bring my own dataset.
I solve the problem by ensuring the accuracy of original dataset, I guess any data error(include: data_path, data_fromat, data_shape, etc) will cause this issues, just guess.
In my case, the data format is wrong. My .xml file record the bndbox by (Xmin、Xmax、Ymin、Ymax). After I converted it became (x1, y1, x2, y2, x3, y3, x4, y4) in convert_data_to_tfrecord.py (function: read_xml_gtbox_and_label ), it can train now.
# original code:
if child_item.tag == 'bndbox':
tmp_box = []
for node in child_item:
tmp_box.append(int(node.text))
# My Modification:
if child_item.tag == 'bndbox':
orig_tmp_box = []
tmp_box = []
for node in child_item:
orig_tmp_box.append(int(node.text))
for my_idx in [0,1,2,1,2,3,0,3,]:
tmp_box.append(orig_tmp_box[my_idx])
assert label is not None, 'label is none, error'
tmp_box.append(label)
box_list.append(tmp_box)
I think I had found what causes the problem. In my case, I encountered the same error after I removed some training examples by applying some filters in data/io/convert_data_to_tfrecord.py. It looks like you have to close the tfrecord writer handle after you finish the conversion to prevent the problem from happening. just put a line: writer.close() to the end of data/io/convert_data_to_tfrecord.py, the problem will be gone.
I was able to overcome this error in Google Colab by reducing the amount of data i fed into the tfrecord. My original tfrecord for all my data was around 16Gb. I broke up my data into smaller ~3Gb tfrecords (this was about 1000 1024x1024 images with annotations). I then trained a detector using the first tfrecord, and then training ended, I resumed training with the next tfrecord.
https://github.com/yangxue0827/FPN_Tensorflow/issues/35#issuecomment-414103141 按照2楼的方法,先跑一下data/io/convert_data_to_tfrecord.py,在data/tfrecord下面会生成一个比数据集大很多的文件,然后跑train.py就行了。
这里的data/tfrecord文件夹需要新建,源代码data下面没有
HI
I get the following error when trying to train the model using train1.py on my custom data set. I am using resnet-101 as the back end. Can you please help me out here?
Traceback (most recent call last): File "train1.py", line 262, in
train()
File "train1.py", line 224, in train
fast_rcnn_total_loss, total_loss, train_op])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](get_batch/batch/padding_fif
o_queue, get_batch/batch/n)]]
Caused by op u'get_batch/batch', defined at: File "train1.py", line 262, in
train()
File "train1.py", line 36, in train
is_training=True)
File "../data/io/read_tfrecord.py", line 86, in next_batch
dynamic_pad=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 922, in batch
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 716, in _batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 457, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1342, in _queue_dequeue_many_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): PaddingFIFOQueue '_1_get_batch/batch/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0) [[Node: get_batch/batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](get_batch/batch/padding_fif o_queue, get_batch/batch/n)]]