zengarden / light_head_rcnn

Light-Head R-CNN
834 stars 222 forks source link

Do anyone run successfully on a single gpu GTX 1060?I tried it and out memory #61

Open ShenDC opened 5 years ago

ShenDC commented 5 years ago

i try to train my dataset on a single gpu GTX 1060 6GB,and it break out out of memory aloways at third epoch, if you have any suggestion about how to fix it, very grateful. 2018-06-21 08:53:49.853249: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats: Limit: 5856854016 InUse: 5832717824 MaxInUse: 5845060608 NumAllocs: 2163 MaxAllocSize: 1121255424

2018-06-21 08:53:49.853344: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **** 2018-06-21 08:53:49.853378: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[2,50,50,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call return fn(*args) File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn status, run_metadata) File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,100,100,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv2/Relu, resnet_v1_101/block2/unit_4/bottleneck_v1/conv3/weights/read/_1533)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: tower_4/gradients/tower_4/resnet_v1_101_3/block3/unit_11/bottleneck_v1/conv2/Conv2D_grad/tuple/control_dependency_1/_6953 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_33582_tower_4/gradients/tower_4/resnet_v1_101_3/block3/unit_11/bottleneck_v1/conv2/Conv2D_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 265, in train(args) File "train.py", line 213, in train sess_ret = sess.run(sess2run, feed_dict=feed_dict) File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run feed_dict_tensor, options, run_metadata) File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run options, run_metadata) File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,100,100,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv2/Relu, resnet_v1_101/block2/unit_4/bottleneck_v1/conv3/weights/read/_1533)]]

karansomaiah commented 5 years ago

How are you getting tower 4 when you're only running with one GPU? Can you post your train.py command? Or are you making changes in your training file?