Open YashRunwal opened 3 years ago
As I know, the txty loss drops very slowly, making me distressed, too. But, I don't think it is the reason why your mAP is zero.
Before evaluating, I suggest that you could try to run the test.py to visualize your predictions(Maybe you need to change some codes in test.py to fit your own dataset.) to check the output of the model.
By the way, I find that the IoU-aware prediction impairs the performance as it makes the scores lower. I sincerely suggest to remove the iou-aware branch in your task, or you can squaring the score ( score=cls_pred.sigmoid() * iou_aware_pred.sigmoid() ) just as FCOSv2 does. Of course, you could do some ablation studies on your dataset to determine whether you need it.
@yjh0410
Where do you propose this score
variable should come in the model? Should I replace the iou_aware_loss with the score? I don't quite get it.
I would like to train using iou_aware_pred and without, then record the mAP and I will inform you too.
Edit:
In the forward function of the model, there already is a score
as can be seen below:
else:
with torch.no_grad():
# Class prediction
cls_pred = torch.sigmoid(cls_pred) * torch.sigmoid(iou_aware_pred)
Is this what you mean? Or do you want to say that this should also be in the trainable
part of the code?
@yjh0410 The txty loss after 26 epochs:
Interesting. What do you think? My plan is to train the model for around 100 epochs. I will first check the model results after 25, 40 and 50 epochs as I have saved the checkpoints. But something needs to be done for this txty_loss, right?
@YashRunwal
What I mean is that you can change the code cls_pred = torch.sigmoid(cls_pred) * torch.sigmoid(iou_aware_pred)
to cls_pred = torch.sigmoid(torch.sigmoid(cls_pred) * torch.sigmoid(iou_aware_pred))
.
As you see, the loss of txty drops very slowly,but it is really ok.
@yjh0410 I will try it out. Currently, the model is training. Soon after that, I will try it out. Also, I would like to discuss with you something which involves developing the architecture further. It involves using 2 types of images. Could you provide your email address if possible? I can also open an issue on this later if you'd like
@yjh0410 Regarding the mAP and the txty loss, I trained (ResNet18 backbone) for around 150 epochs but the txty loss is not dropping. It is still in the range of 10-15. So I evaluated the validation dataset and I get the mAP of 0.001.
Don't understand why to be honest. Can you help out?
@yjh0410 I am using batch_size=1 because of my GPU capability and training data size. I am using an SGD optimizer with a learning rate of 1e-3
with warmup strategy. However, thetxty loss
doesn't converge.
Do you think it is better to use Adam Optimizer with the same learning rate or increase the learning_rate
to 0.02 when training with SGD? I would like to train for max 70 Epochs, again due to my GPU.
@YashRunwal When you use batch_size=1, there is a big problem in BN which is sensitive to batch size. Maybe you should try other normalization layer like Instance Normalization(IN).
I suggest that you'd better to resize your input image to smaller size. It is not necessary to use the big input size. In my project, I just use 512x512 not the original size of the input image.
@yjh0410 I cannot resize the image. That's a constraint. However, I want to ask whether I can use the pretrained ResNet model with Instance Normalization?
@YashRunwal Maybe you can freeze the bn layer in the pretrained ResNet model. As for the other bn layers, you can try the Group Normalization.
@yjh0410 Yep I will do it today. I will let you know the results. I will use Groups=32 for GN. But wouldn't the loss increase by freezing the BN layers in the backbone?
@yjh0410 Even after freezing the BN layers in the backbone (ResNet18), the loss is not decreasing. I have tried the following strategies:
But the loss decreases slightly and remains constant. After a few epochs, the AP of the validation also doesn't improve.
@yjh0410 Would using a different loss function for txty_loss
solve this issue? If so, what can we change?
@YashRunwal Maybe you can use the gradient accumulate method refering to YOLOv5, to alleviate the problem of only 1 batch size in yout task.
@yjh0410 Yes, I tried using Gradient Accumulation with BatchNorm layers. I don't think we can use GN layers with Gradient Accumulation. I tried using 32, 64 groups. But the result is still the same. I am training using 2700 images. Could this be the reason? Do I need more data? I mean, after a few epochs the validation accuracy starts decreasing.
@yjh0410 ,
As you know, I am training the model with grayscale images (512, 1536) with a few augmentation techniques.
I have trained the model (pre-trained Resnet 18) for about 20 epochs and the txty_loss is not decreasing at all. It is ranging between 20-25. So after 20 epochs, I evaluated the model on the validation dataset. The mAP is 0.0.
Why is this? Does it need to be trained for a longer period of time?
Appreciate your help.