yangxue0827 / R-DFPN_FPN_Tensorflow

R-DFPN: Rotation Dense Feature Pyramid Networks (Tensorflow)
http://www.mdpi.com/2072-4292/10/1/132
120 stars 47 forks source link

out of memory #6

Open DL-ljw opened 6 years ago

DL-ljw commented 6 years ago

Sorry for bothering you again. When I train it with one 1080 GPU with batchsize of 1. I got the following mistakes. How can I solve it?

2018-05-10 13:42:49: step247692 image_name:000624.jpg |
rpn_loc_loss:0.189756244421 | rpn_cla_loss:0.214562356472 | rpn_total_loss:0.404318600893 | fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.00815858319402 | fast_rcnn_total_loss:0.00815858319402 | total_loss:1.17546725273 | per_cost_time:0.65540599823s out of memory invalid argument 2018-05-10 13:42:53.349625: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency 2018-05-10 13:42:53.349637: I tensorflow/stream_executor/stream.cc:4138] stream 0x55cd063dc880 did not memcpy host-to-device; source: 0x7fa30b0da010 2018-05-10 13:42:53.349641: E tensorflow/stream_executor/stream.cc:289] Error recording event in stream: error recording CUDA event on stream 0x55cd063dc950: CUDA_ERROR_ILLEGAL_ADDRESS; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-05-10 13:42:53.349647: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS 2018-05-10 13:42:53.349650: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 an illegal memory access was encountered an illegal memory access was encountered

powermano commented 6 years ago

Same problem. After 5000 steps, the problem occurs.

powermano commented 6 years ago

What is your cudnn version?

DL-ljw commented 6 years ago

cuda8.0 cudnn5.0

liqi-lizezhong commented 5 years ago

I have met the same question, do you have solved it. and how to sovle this. Thanks

powermano commented 5 years ago

I found that by reducing the anchors can somehow alleviate this problem, you can reduce some angels or ratios in R-DFPN_FPN_Tensorflow/libs/configs/cfgs.py

ANCHOR_ANGLES = [-90, -75, -60, -45, -30, -15] ANCHOR_RATIOS = [1/5., 5., 1/7., 7., 1/9, 9]

I encountered the problem of CUDA_ERROR_ILLEGAL_ADDRESS error during training when the objects are densely located, so control the objects in your own dataset( reduce some really exsiting objects) can also alleviate this problem. It works but not all the time.

------------------ 原始邮件 ------------------ 发件人: "李泽中"notifications@github.com; 发送时间: 2019年1月14日(星期一) 下午4:19 收件人: "yangxue0827/R-DFPN_FPN_Tensorflow"R-DFPN_FPN_Tensorflow@noreply.github.com; 抄送: "victor"894773140@qq.com; "Comment"comment@noreply.github.com; 主题: Re: [yangxue0827/R-DFPN_FPN_Tensorflow] out of memory (#6)

I have met the same question, do you have solved it. and how to sovle this. Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

clw5180 commented 5 years ago

I found that by reducing the anchors can somehow alleviate this problem, you can reduce some angels or ratios in R-DFPN_FPN_Tensorflow/libs/configs/cfgs.py ANCHOR_ANGLES = [-90, -75, -60, -45, -30, -15] ANCHOR_RATIOS = [1/5., 5., 1/7., 7., 1/9, 9] I encountered the problem of CUDA_ERROR_ILLEGAL_ADDRESS error during training when the objects are densely located, so control the objects in your own dataset( reduce some really exsiting objects) can also alleviate this problem. It works but not all the time. ------------------ 原始邮件 ------------------ 发件人: "李泽中"notifications@github.com; 发送时间: 2019年1月14日(星期一) 下午4:19 收件人: "yangxue0827/R-DFPN_FPN_Tensorflow"R-DFPN_FPN_Tensorflow@noreply.github.com; 抄送: "victor"894773140@qq.com; "Comment"comment@noreply.github.com; 主题: Re: [yangxue0827/R-DFPN_FPN_Tensorflow] out of memory (#6) I have met the same question, do you have solved it. and how to sovle this. Thanks — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Thanks a lot! it works