Training Problem - Githubissues

trungpham2606 commented 4 years ago

@youngwanLEE Thank you for sharing such a great work. Iam using your Centermask on my dataset. But the loss is so weird I have captured it here:

So I changed to coco dataset, and the loss is the nearly the same. It looks so strange. Can you have me to figure out what the problem here is ?

youngwanLEE commented 4 years ago

@trungpham2606 Hi, this happens when iterations are under the warmup period.

After warmup iterations, loss_mask and loss_maskiou converge normally.

Please train longer or when you face this problem again, re-open this issue.

trungpham2606 commented 4 years ago

@youngwanLEE I set the warmup_iters = 20 After 20 iterations, the losses are the same. Does that value affect the training process ?

youngwanLEE commented 4 years ago

@trungpham2606 Yes, I recommend longer warmup_iters.

trungpham2606 commented 4 years ago

@youngwanLEE oke, lets me check it again.

trungpham2606 commented 4 years ago

@youngwanLEE The losses are unchanged after 700 iters.

youngwanLEE commented 4 years ago

Could you show me your config file?

trungpham2606 commented 4 years ago

@youngwanLEE here it's MODEL: META_ARCHITECTURE: "GeneralizedRCNN" WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50" BACKBONE: CONV_BODY: "R-50-FPN-RETINANET" RESNETS: BACKBONE_OUT_CHANNELS: 256 RPN_ONLY: True FCOS_ON: True FCOS_MASK: True RETINANET: USE_C5: False # FCOS uses P5 instead of C5 MASK_ON: True MASKIOU_ON: True FCOS: CENTER_SAMPLE: True POS_RADIUS: 1.5 LOC_LOSS_TYPE: "giou" INFERENCE_TH: 0.03 ROI_HEADS: USE_FPN: True ROI_MASK_HEAD: POOLER_SCALES: (0.125, 0.0625, 0.03125) # 1/8, 1/16, 1/32 FEATURE_EXTRACTOR: "MaskRCNNFPNSpatialAttentionFeatureExtractor" LEVEL_MAP_FUNCTION: "CenterMaskLevelMapFunc" PREDICTOR: "MaskRCNNC4Predictor" POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 2 RESOLUTION: 28 SHARE_BOX_FEATURE_EXTRACTOR: False DATASETS: TRAIN: ("coco_2014_train", "coco_2014_valminusminival") TEST: ("coco_2014_minival",) INPUT: MIN_SIZE_RANGE_TRAIN: (640, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1333 DATALOADER: SIZE_DIVISIBILITY: 32 SOLVER: BASE_LR: 0.01 WEIGHT_DECAY: 0.0001 STEPS: (120000, 160000) MAX_ITER: 180000 IMS_PER_BATCH: 16 WARMUP_METHOD: "constant" OUTPUT_DIR : 'checkpoints/CenterMask-R-50-FPN-Scoring-MS-2x'

youngwanLEE commented 4 years ago

@trungpham2606 ,

When I tried to the config, the logs are shown as below,

In my server, we can train normally.

I cannot figure out why your problem happens.

trungpham2606 commented 4 years ago

@youngwanLEE so strange. Oke I will dig in it carefully. Tks for your prompt support.

stigma0617 commented 4 years ago

@trungpham2606

How many GPU you use?

In my case, I use Titan Xp 8 GPUs

trungpham2606 commented 4 years ago

Vào Th 4, 11 thg 12, 2019 lúc 16:38 stigma0617 notifications@github.com đã viết:

@trungpham2606 https://github.com/trungpham2606

How many GPU you use?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/youngwanLEE/CenterMask/issues/1?email_source=notifications&email_token=AIQBNUQ3FU57N44K36VQ7V3QYCYI5A5CNFSM4JZLBV72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGSPL7Q#issuecomment-564459006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQBNUUPDRDSGKXAHKPOXL3QYCYI5ANCNFSM4JZLBV7Q .

Only 1 -- Ha Trung Pham, Mechatronics, Ha Noi University of Science and Technology, Viet Nam

stigma0617 commented 4 years ago

@trungpham2606

Only 1 GPU can use 16 batch size?

trungpham2606 commented 4 years ago

@stigma0617 opps i sent wrong config. But i changed nothing except img_per_batch to 1.

stigma0617 commented 4 years ago

@trungpham2606

I guess the problem is caused by the batch size of 1.

Batch normalization needs at least 2 images per batch.

trungpham2606 commented 4 years ago

@stigma0617 yeah, maybe it's the problem I will try it tmr with other GPU.

trungpham2606 commented 4 years ago

@stigma0617 Can you change the batch_size to 1 to see what happens ? When I used the smaller net (centermask_R_50_FPN_lite_res600_ms_bs32_1x.yaml), I can train with batch_size 4 but the losses are nearly the same.

trungpham2606 commented 4 years ago

@stigma0617 @youngwanLEE The problem here I think is from the warm up iterations, if I follow your training schedule (load resnet imagenet weights only), 500 warmup iterations are not enough I guess and in my case it is unnecessary. So now, Iam loading your pretrained weights (for example: centermask-R-101-ms-2x.pth) the losses are now reasonable.

Thank you guys for spending time with my issues.

shoutOutYangJie commented 4 years ago

@stigma0617 @youngwanLEE The problem here I think is from the warm up iterations, if I follow your training schedule (load resnet imagenet weights only), 500 warmup iterations are not enough I guess and in my case it is unnecessary. So now, Iam loading your pretrained weights (for example: centermask-R-101-ms-2x.pth) the losses are now reasonable.

Thank you guys for spending time with my issues.

but your loss is still large.

shoutOutYangJie commented 4 years ago

@trungpham2606

How many GPU you use?

In my case, I use Titan Xp 8 GPUs

So can you train centermask with author's config normally.

trungpham2606 commented 4 years ago

@shoutOutYangJie not too big right :)))) I dont remember exactly which dataset I was using at that moment. But I gave up this work ^^. Can you train this network normally, feel free to send me some results of your training. Iam looking forward to it.

shoutOutYangJie commented 4 years ago

@stigma0617 Can you train centermask normally?

zimenglan-sysu-512 commented 4 years ago

@stigma0617 Can you change the batch_size to 1 to see what happens ? When I used the smaller net (centermask_R_50_FPN_lite_res600_ms_bs32_1x.yaml), I can train with batch_size 4 but the losses are nearly the same.

i thain the centermask_V_39_eSE_FPN_lite_res600_ms_bs16_4x.yaml, it encouters the NaN, how to fix it?

youngwanLEE commented 4 years ago

@zimenglan-sysu-512

How about lowering base_lr?

youngwanLEE / CenterMask

Training Problem #1