Loss NaN about using vovnet as backbone in RetinaNet

y200504040u commented 4 years ago

Hi! Thank you for your great work. I wanted to improve RetinaNet project in detectron2/projects by replacing "retinanet_resnet_fpn_backbone" with "retinanet_vovnet_fpn_backbone". However, I always encounterd "loss NaN" in period of less than 1000 iterations during training . Training by "retinanet_resnet_fpn_backbone" is OK.

I want to make sure that I wasn't doing something wrong.

my config yaml:

_BASE_: "../Base-RetinaNet.yaml"
MODEL:
  WEIGHTS: "./pre_train/vovnet39_ese_detectron2.pth"
  RETINANET:
    NUM_CLASSES: 2
  BACKBONE:
    NAME: "build_retinanet_vovnet_fpn_backbone"
    FREEZE_AT: 0
  VOVNET:
    CONV_BODY : "V-39-eSE"
    OUT_FEATURES: ["stage3", "stage4", "stage5"]
  FPN:
    IN_FEATURES: ["stage3", "stage4", "stage5"]
SOLVER:
  STEPS: (210000, 250000)
  MAX_ITER: 270000
OUTPUT_DIR: "output/retina/V_39_ms_3x"

build_retinanet_vovnet_fpn_backbone

@BACKBONE_REGISTRY.register()
def build_retinanet_vovnet_fpn_backbone(cfg, input_shape: ShapeSpec):
    """
    Args:
        cfg: a detectron2 CfgNode

    Returns:
        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
    """

    bottom_up = build_vovnet_backbone(cfg, input_shape)
    in_features = cfg.MODEL.FPN.IN_FEATURES
    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
    in_channels_top = out_channels
    top_block = LastLevelP6P7(in_channels_top, out_channels, "p5")
    # in_channels_p6p7 = bottom_up.output_shape()["res5"].channels
    backbone = FPN(
        bottom_up=bottom_up,
        in_features=in_features,
        out_channels=out_channels,
        norm=cfg.MODEL.FPN.NORM,
        top_block=top_block,
        # top_block=LastLevelP6P7(in_channels_p6p7, out_channels),
        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
    )
    return backbone

cxx921656591 commented 4 years ago

Nice copy LOL. By the way, I think it's because your learning rate is too big. I think you can try to lower it 10-100 times. And don't forget to longer your iteration.

y200504040u commented 4 years ago

Nice copy LOL. By the way, I think it's because your learning rate is too big. I think you can try to lower it 10-100 times. And don't forget to longer your iteration.

cut-and-pasted😂... I tried lower learning rate, I got loss without decreasing instead of loss explosion. I read vovNet paper, author didn't use vovNet to be backbone in any object detection network except RefineDet in experiments.

Cyril9227 commented 4 years ago

Same error, can't manage to fit a vovnet-lite-dw or a vovnet-19-dw, keep getting NaN loss. Vovnet-lite is fine tho, I have the feeling that there is something wrong with the depthwise convolution.

lsrock1 commented 4 years ago

When I tested this kind of lightweight backbone in object detection (ex, mobilenet, shufflenet etc..), i set warm up iter longer.

youngwanLEE / vovnet-detectron2

Loss NaN about using vovnet as backbone in RetinaNet #8