yangxue0827 / R2CNN_FPN_Tensorflow

R2CNN: Rotational Region CNN Based on FPN (Tensorflow)
419 stars 139 forks source link

getting OOM with custom dataset #10

Open ravikantb opened 6 years ago

ravikantb commented 6 years ago

Hi,

I have trained this model (Resnet101) on my custom dataset with few hundred records but as soon as I increase that to ~80k records I am getting following memory error. I have tried decreasing RPN/Fast-RCNN batch sizes but still get the same error while allotting memory for one or the other tensor. I have 2 GTX 1080 GPUs but the error invariably comes no matter what config I use (single GPU, multi GPU with different batch sizes). Any help on how to avoid it will be greatly appreciated. Thanks.

2018-02-14 18:03:35.915512: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **************************************************************************************************** 2018-02-14 18:03:35.915558: W tensorflow/core/framework/op_kernel.cc:1192] **Resource exhausted: OOM when allocating tensor with shape[1,75,570,256]** 2018-02-14 18:03:35.915634: W tensorflow/core/framework/op_kernel.cc:1192] Internal: Dst tensor is not initialized. [[Node: make_anchors/make_anchors_all_level/make_anchors_P6/range/_3411 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16518_make_anchors/make_anchors_all_level/make_anchors_P6/range", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](^make_anchors/make_anchors_all_level/make_anchors_P6/range/_3410)]] 2018-02-14 18:03:35.915726: W tensorflow/core/framework/op_kernel.cc:1192] Internal: Dst tensor is not initialized. [[Node: make_anchors/make_anchors_all_level/make_anchors_P6/range/_3411 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16518_make_anchors/make_anchors_all_level/make_anchors_P6/range", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](^make_anchors/make_anchors_all_level/make_anchors_P6/range/_3410)]] 2018-02-14 18:03:35.915792: W tensorflow/core/framework/op_kernel.cc:1192] Internal: Dst tensor is not initialized. [[Node: make_anchors/make_anchors_all_level/make_anchors_P6/range/_3411 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16518_make_anchors/make_anchors_all_level/make_anchors_P6/range", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](^make_anchors/make_anchors_all_level/make_anchors_P6/range/_3410)]] 2018-02-14 18:03:35.915843: W tensorflow/core/framework/op_kernel.cc:1192] Internal: Dst tensor is not initialized. [[Node: make_anchors/make_anchors_all_level/make_anchors_P6/range/_3411 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16518_make_anchors/make_anchors_all_level/make_anchors_P6/range", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](^make_anchors/make_anchors_all_level/make_anchors_P6/range/_3410)]] 2018-02-14 18:03:35.919749: W tensorflow/core/kernels/queue_base.cc:295] _0_get_batch/input_producer: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920027: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920066: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920089: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920110: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920132: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920174: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920195: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920214: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920235: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920281: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920306: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920322: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920343: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920362: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920382: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2018-02-14 18:03:35.920404: W tensorflow/core/kernels/queue_base.cc:295] _2_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed Traceback (most recent call last): File "/home/neo/ML/R2CNN_FPN_Tensorflow/tools/train1.py", line 263, in <module> train() File "/home/neo/ML/R2CNN_FPN_Tensorflow/tools/train1.py", line 225, in train fast_rcnn_total_loss, total_loss, train_op]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[Node: make_anchors/make_anchors_all_level/make_anchors_P6/range/_3411 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16518_make_anchors/make_anchors_all_level/make_anchors_P6/range", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](^make_anchors/make_anchors_all_level/make_anchors_P6/range/_3410)]] [[Node: gradients/rpn_net/concat_1_grad/Shape_2/_3305 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_15992_gradients/rpn_net/concat_1_grad/Shape_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

FantasticEthan commented 6 years ago

@ravikantb i have met the same problems.If you had solved this ,can you share the solution ?

yangxue0827 commented 6 years ago

Please change the value of SHORT_SIDE_LEN in cfgs.py.

ravikantb commented 6 years ago

@Bboysummer : I have not yet solved the problem but as the author has suggested, I am trying to run with smaller SHORT_SIDE_LEN. I am getting some errors right now which could be due to the fact that I changed some other configs as I am working with text dataset. I will update here in case I succeed.

@yangxue0827 : Thanks for your suggestion. Could you please suggest some other ways also to reduce GPU memory footprint? I am working with text containing images and would like to keep SHORT_SIDE_LEN higher as otherwise image resizing loses some data and text becomes blurry. I have successfully trained py-Faster-RCNN with higher shortest side and would like to test R2CNN the same way. I am also playing with anchors as I do not need all the provided anchors. I hope that should help a bit.

FantasticEthan commented 6 years ago

@ravikantb Are you using your own dataset? i change the value of short_side_len and crop each of my pictures but the problem still happens. i think maybe there are some mistakes in the label of dataset.You can check your dataset too. I will update if i succeed

yangxue0827 commented 6 years ago

There was no problem when I set short_side_len=720 in dataset of icdar2015.