执行train.py一直OOM

BngThea commented 4 years ago

运行启动gpu后 2020-02-22 10:32:07.920229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10023 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5) 提示： UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. 然后开始训练，就显存溢出 2020-02-22 10:56:41.258201: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at tile_ops.cc:220 : Resource exhausted: OOM when allocating tensor with shape[512,7,7,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

我在网上搜了相关问题，提示是tf.gather，但是解决方案都是针对特定代码的，您知道这是怎么回事吗？谢谢！

yizt commented 4 years ago

@BngThea 减少这两个参数 IMAGES_PER_GPU , IMAGE_MAX_DIM

BngThea commented 4 years ago

@yizt 谢谢，我将IMAGES_PER_GPU从2设为1，IMAGE_MAX_DIM从720设为500，可以运行了

BngThea commented 4 years ago

@yizt 您好，刚按上面的改了，但是训练的时候loss爆炸了，重启了几次都是如此 40/1252 [..............................] - ETA: 23:06 - loss: 245879418.9902 - rpn_bbox_loss: 0.6706 - rpn_class_loss: 0.5414 - rcnn_bbox_loss: 0.8370 - rcnn_class_loss: 1.3189 - regular_loss: 52.1087 - gt_num: 2.9813 - positive_anchor_num: 12.7000 - negative_anchor_num: 67.3000 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6063 - roi_num: 1969.6062 - positive_roi_num: 20.3312 - negativ 41/1252 [..............................] - ETA: 22:45 - loss: 249477870.6246 - rpn_bbox_loss: 0.6715 - rpn_class_loss: 0.5355 - rcnn_bbox_loss: 0.8344 - rcnn_class_loss: 1.3020 - regular_loss: 52.8713 - gt_num: 2.9390 - positive_anchor_num: 12.4817 - negative_anchor_num: 67.5183 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6082 - roi_num: 1970.3475 - positive_roi_num: 20.1280 - negativ 42/1252 [>.............................] - ETA: 22:25 - loss: 252904967.4192 - rpn_bbox_loss: 0.6670 - rpn_class_loss: 0.5286 - rcnn_bbox_loss: 0.8322 - rcnn_class_loss: 1.2902 - regular_loss: 53.5976 - gt_num: 2.9226 - positive_anchor_num: 12.5238 - negative_anchor_num: 67.4762 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6104 - roi_num: 1971.0536 - positive_roi_num: 20.2143 - negativ 43/1252 [>.............................] - ETA: 22:06 - loss: 256172664.3630 - rpn_bbox_loss: 0.6636 - rpn_class_loss: 0.5222 - rcnn_bbox_loss: 0.8288 - rcnn_class_loss: 1.2761 - regular_loss: 54.2901 - gt_num: 2.8953 - positive_anchor_num: 12.3605 - negative_anchor_num: 67.6395 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6125 - roi_num: 1971.1279 - positive_roi_num: 20.1919 - negativ 44/1252 [>.............................] - ETA: 21:47 - loss: 259291829.6274 - rpn_bbox_loss: 0.6674 - rpn_class_loss: 0.5176 - rcnn_bbox_loss: 0.8258 - rcnn_class_loss: 1.2660 - regular_loss: 54.9511 - gt_num: 2.9375 - positive_anchor_num: 12.2898 - negative_anchor_num: 67.7102 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6101 - roi_num: 1970.9773 - positive_roi_num: 20.2557 - negativ 45/1252 [>.............................] - ETA: 21:30 - loss: 262272365.3246 - rpn_bbox_loss: 0.6655 - rpn_class_loss: 0.5145 - rcnn_bbox_loss: 0.8242 - rcnn_class_loss: 1.2521 - regular_loss: 55.5828 - gt_num: 2.9667 - positive_anchor_num: 12.3722 - negative_anchor_num: 67.6278 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6090 - roi_num: 1971.1111 - positive_roi_num: 20.2444 - negativ

yizt commented 4 years ago

下午我测试下，给您回复哈

	易作天

邮箱：csuyzt@163.com |

签名由网易邮箱大师定制

在2020年02月22日 11:41，BngThea 写道：

@yizt 您好，刚按上面的改了，但是训练的时候loss爆炸了，重启了几次都是如此 40/1252 [..............................] - ETA: 23:06 - loss: 245879418.9902 - rpn_bbox_loss: 0.6706 - rpn_class_loss: 0.5414 - rcnn_bbox_loss: 0.8370 - rcnn_class_loss: 1.3189 - regular_loss: 52.1087 - gt_num: 2.9813 - positive_anchor_num: 12.7000 - negative_anchor_num: 67.3000 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6063 - roi_num: 1969.6062 - positive_roi_num: 20.3312 - negativ 41/1252 [..............................] - ETA: 22:45 - loss: 249477870.6246 - rpn_bbox_loss: 0.6715 - rpn_class_loss: 0.5355 - rcnn_bbox_loss: 0.8344 - rcnn_class_loss: 1.3020 - regular_loss: 52.8713 - gt_num: 2.9390 - positive_anchor_num: 12.4817 - negative_anchor_num: 67.5183 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6082 - roi_num: 1970.3475 - positive_roi_num: 20.1280 - negativ 42/1252 [>.............................] - ETA: 22:25 - loss: 252904967.4192 - rpn_bbox_loss: 0.6670 - rpn_class_loss: 0.5286 - rcnn_bbox_loss: 0.8322 - rcnn_class_loss: 1.2902 - regular_loss: 53.5976 - gt_num: 2.9226 - positive_anchor_num: 12.5238 - negative_anchor_num: 67.4762 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6104 - roi_num: 1971.0536 - positive_roi_num: 20.2143 - negativ 43/1252 [>.............................] - ETA: 22:06 - loss: 256172664.3630 - rpn_bbox_loss: 0.6636 - rpn_class_loss: 0.5222 - rcnn_bbox_loss: 0.8288 - rcnn_class_loss: 1.2761 - regular_loss: 54.2901 - gt_num: 2.8953 - positive_anchor_num: 12.3605 - negative_anchor_num: 67.6395 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6125 - roi_num: 1971.1279 - positive_roi_num: 20.1919 - negativ 44/1252 [>.............................] - ETA: 21:47 - loss: 259291829.6274 - rpn_bbox_loss: 0.6674 - rpn_class_loss: 0.5176 - rcnn_bbox_loss: 0.8258 - rcnn_class_loss: 1.2660 - regular_loss: 54.9511 - gt_num: 2.9375 - positive_anchor_num: 12.2898 - negative_anchor_num: 67.7102 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6101 - roi_num: 1970.9773 - positive_roi_num: 20.2557 - negativ 45/1252 [>.............................] - ETA: 21:30 - loss: 262272365.3246 - rpn_bbox_loss: 0.6655 - rpn_class_loss: 0.5145 - rcnn_bbox_loss: 0.8242 - rcnn_class_loss: 1.2521 - regular_loss: 55.5828 - gt_num: 2.9667 - positive_anchor_num: 12.3722 - negative_anchor_num: 67.6278 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6090 - roi_num: 1971.1111 - positive_roi_num: 20.2444 - negativ

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

yizt commented 4 years ago

@BngThea 请更新代码，再试试看

BngThea commented 4 years ago

@yizt 您好，我更新后测试了5次，有两次loss增加的稍微缓慢了一些，但最终也是增加的，另外3次没有改善，甚至更快爆炸

yizt commented 4 years ago

@BngThea 我将IMAGES_PER_GPU也设置为1，IMAGE_MAX_DIM设为500；不会出现loss爆炸； 2020-02-23 08:46:09.669647: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2020-02-23 08:46:10.244852: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 1872/22136 [=>............................] - ETA: 1:19:18 - loss: 4.4972 - rpn_bbox_loss: 0.7863 - rpn_class_loss: 0.2431 - rcnn_bbox_loss: 0.6923 - rcnn_class_loss: 0.6140 - regular_loss: 13.5618 - gt_num: 2.5067 - positive_anchor_num: 6.9930 - negative_anchor_num: 73.0070 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.5722 - roi_num: 1426.8147 - positive_roi_num: 15.3840 - negative_roi_num: 10

另外IMAGES_PER_GPU=3，IMAGE_MAX_DIM=720在RTX 2080 Ti上跑没有问题，我也是用的RTX 2080 Ti

BngThea commented 4 years ago

@yizt 那您还是用的1.9版本的tf吗，我现在用的1.14版本的，因为cuda版本是10.1的其他demo都是在2.x或者1.14下跑的，不想改动cuda版本，这会有影响吗，keras用的2.2.5

yizt commented 4 years ago

@BngThea tf版本也是1.14，cuda是V10.0.130; 现在工程用的是tf自带的keras

BngThea commented 4 years ago

@yizt 很奇怪，我在Ubuntu18.04环境下同硬件配置下就会出现loss爆炸，而在win10下却可以正常跑

另外还有几个问题： 1 我用resnet跑了80个epoch，loss值在0.3左右，mAP值却很低，您跑出来的loss大概什么值 2 我有自己的一批数据集，已经整理为VOC2007格式的了，其中每幅图gt就1个，size固定为378*427，该如何调整config配置来进行训练？固定size我通过调整对应函数搞定了，用您的模型默认跑出来的结果和tensorflow版本的faster rcnn （https://github.com/smallcorgi/Faster-RCNN_TF）差距很大，想知道config中的其他参数如何调整？生成anchor的gt函数该怎么设置cluster数？谢谢！

wanghangege commented 3 years ago

请问训练到一半就停止了，是在进行测试吗

yizt / keras-faster-rcnn

执行train.py一直OOM #32