yangxue0827 / R-DFPN_FPN_Tensorflow

R-DFPN: Rotation Dense Feature Pyramid Networks (Tensorflow)
http://www.mdpi.com/2072-4292/10/1/132
120 stars 47 forks source link

Out of Memory #24

Open hbkooo opened 5 years ago

hbkooo commented 5 years ago

When I was training my data of 291 epochs or other, some errors may occur randomly. And I saw the GPU memory , at first it was steady, then it suddenly increased and occur the following errors. How to solve the problem? Thank you.

Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.27GiB. Current allocation summary follows. 2019-07-28 14:29:37.479031: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 4, Chunks in use: 0 1.0KiB allocated for chunks. 19B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-07-28 14:29:37.479056: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 3, Chunks in use: 0 1.5KiB allocated for chunks. 210B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-07-28 14:29:37.479072: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 1, Chunks in use: 0 1.0KiB allocated for chunks. 80B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. ... 2019-07-28 14:29:37.541280: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 21607936 totalling 41.21MiB 2019-07-28 14:29:37.541286: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 9 Chunks of size 23040000 totalling 197.75MiB 2019-07-28 14:29:37.541293: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 29791744 totalling 28.41MiB 2019-07-28 14:29:37.541302: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 30857472 totalling 29.43MiB 2019-07-28 14:29:37.541312: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 44332032 totalling 42.28MiB 2019-07-28 14:29:37.541321: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 51380224 totalling 147.00MiB 2019-07-28 14:29:37.541330: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 1.99GiB 2019-07-28 14:29:37.541345: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: Limit: 10990990132 InUse: 2133677824 MaxInUse: 6298388480 NumAllocs: 1694962 MaxAllocSize: 4294967296

2019-07-28 14:29:37.541752: W tensorflow/core/common_runtime/bfcallocator.cc:277] *____** 2019-07-28 14:29:38.079996: W tensorflow/core/kernels/queue_base.cc:294] _0_get_batch/input_producer: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080480: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080751: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080771: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080789: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080805: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080858: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080873: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080886: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080937: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.080997: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.081011: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.081027: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.081041: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.081054: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.081067: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed 2019-07-28 14:29:38.081078: W tensorflow/core/kernels/queue_base.cc:294] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed Traceback (most recent call last): File "train.py", line 299, in train() File "train.py", line 260, in train fast_rcnn_total_loss, total_loss, train_op]) File "/home/hbk/miniconda3/envs/mytensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run run_metadata_ptr) File "/home/hbk/miniconda3/envs/mytensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run feed_dict_string, options, run_metadata) File "/home/hbk/miniconda3/envs/mytensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run target_list, options, run_metadata) File "/home/hbk/miniconda3/envs/mytensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[Node: rpn_losses/rpn_minibatch/rpn_find_positive_negative_samples/PyFunc/_3605 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_20948_rpn_losses/rpn_minibatch/rpn_find_positive_negative_samples/PyFunc", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Avant-Gardiste commented 2 years ago

Try to reduce the batch size !