txty_loss not decreasing and mAP returns the value 0.0

YashRunwal commented 3 years ago

@yjh0410 ,

As you know, I am training the model with grayscale images (512, 1536) with a few augmentation techniques.

I have trained the model (pre-trained Resnet 18) for about 20 epochs and the txty_loss is not decreasing at all. It is ranging between 20-25. So after 20 epochs, I evaluated the model on the validation dataset. The mAP is 0.0.

Why is this? Does it need to be trained for a longer period of time?

Appreciate your help.

yjh0410 commented 3 years ago

As I know, the txty loss drops very slowly， making me distressed, too. But, I don't think it is the reason why your mAP is zero.

Before evaluating, I suggest that you could try to run the test.py to visualize your predictions(Maybe you need to change some codes in test.py to fit your own dataset.) to check the output of the model.

By the way, I find that the IoU-aware prediction impairs the performance as it makes the scores lower. I sincerely suggest to remove the iou-aware branch in your task, or you can squaring the score ( score=cls_pred.sigmoid() * iou_aware_pred.sigmoid() ) just as FCOSv2 does. Of course, you could do some ablation studies on your dataset to determine whether you need it.

YashRunwal commented 3 years ago

@yjh0410 Where do you propose this scorevariable should come in the model? Should I replace the iou_aware_loss with the score? I don't quite get it.

I would like to train using iou_aware_pred and without, then record the mAP and I will inform you too.

Edit:

In the forward function of the model, there already is a scoreas can be seen below:

 else:
     with torch.no_grad():
      # Class prediction
      cls_pred = torch.sigmoid(cls_pred) * torch.sigmoid(iou_aware_pred)

Is this what you mean? Or do you want to say that this should also be in the trainablepart of the code?

YashRunwal commented 3 years ago

@yjh0410 The txty loss after 26 epochs:

Interesting. What do you think? My plan is to train the model for around 100 epochs. I will first check the model results after 25, 40 and 50 epochs as I have saved the checkpoints. But something needs to be done for this txty_loss, right?

yjh0410 commented 3 years ago

@YashRunwal What I mean is that you can change the code cls_pred = torch.sigmoid(cls_pred) * torch.sigmoid(iou_aware_pred)to cls_pred = torch.sigmoid(torch.sigmoid(cls_pred) * torch.sigmoid(iou_aware_pred)).

As you see, the loss of txty drops very slowly，but it is really ok.

YashRunwal commented 3 years ago

@yjh0410 I will try it out. Currently, the model is training. Soon after that, I will try it out. Also, I would like to discuss with you something which involves developing the architecture further. It involves using 2 types of images. Could you provide your email address if possible? I can also open an issue on this later if you'd like

YashRunwal commented 3 years ago

@yjh0410 Regarding the mAP and the txty loss, I trained (ResNet18 backbone) for around 150 epochs but the txty loss is not dropping. It is still in the range of 10-15. So I evaluated the validation dataset and I get the mAP of 0.001.

Don't understand why to be honest. Can you help out?

YashRunwal commented 3 years ago

@yjh0410 I am using batch_size=1 because of my GPU capability and training data size. I am using an SGD optimizer with a learning rate of 1e-3 with warmup strategy. However, thetxty lossdoesn't converge.

Do you think it is better to use Adam Optimizer with the same learning rate or increase the learning_rateto 0.02 when training with SGD? I would like to train for max 70 Epochs, again due to my GPU.

yjh0410 commented 3 years ago

@YashRunwal When you use batch_size=1, there is a big problem in BN which is sensitive to batch size. Maybe you should try other normalization layer like Instance Normalization(IN).

I suggest that you'd better to resize your input image to smaller size. It is not necessary to use the big input size. In my project, I just use 512x512 not the original size of the input image.

YashRunwal commented 3 years ago

@yjh0410 I cannot resize the image. That's a constraint. However, I want to ask whether I can use the pretrained ResNet model with Instance Normalization?

yjh0410 commented 3 years ago

@YashRunwal Maybe you can freeze the bn layer in the pretrained ResNet model. As for the other bn layers, you can try the Group Normalization.

YashRunwal commented 3 years ago

@yjh0410 Yep I will do it today. I will let you know the results. I will use Groups=32 for GN. But wouldn't the loss increase by freezing the BN layers in the backbone?

YashRunwal commented 3 years ago

@yjh0410 Even after freezing the BN layers in the backbone (ResNet18), the loss is not decreasing. I have tried the following strategies:

SGD with lr=1e-3 and 80 Epochs
Adam with lr=1e-4 and 80 Epochs
Adam with lr=3e-4 and 80 Epochs

But the loss decreases slightly and remains constant. After a few epochs, the AP of the validation also doesn't improve.

YashRunwal commented 3 years ago

@yjh0410 Would using a different loss function for txty_losssolve this issue? If so, what can we change?

yjh0410 commented 3 years ago

@YashRunwal Maybe you can use the gradient accumulate method refering to YOLOv5, to alleviate the problem of only 1 batch size in yout task.

YashRunwal commented 3 years ago

@yjh0410 Yes, I tried using Gradient Accumulation with BatchNorm layers. I don't think we can use GN layers with Gradient Accumulation. I tried using 32, 64 groups. But the result is still the same. I am training using 2700 images. Could this be the reason? Do I need more data? I mean, after a few epochs the validation accuracy starts decreasing.

yjh0410 / CenterNet-plus

txty_loss not decreasing and mAP returns the value 0.0 #12