Open ck-amrahd opened 2 months ago
The same model works for both image sizes. My GPU memory is not enough for 800 x 1333 images as well. Did you guys get the same accuracy with both input image sizes (800, 1333) & (1200, 2000)?
Is there any way to reduce the memory of the system? I think the memory issues come from O(n^2) memory for the transformer, can we replace that with the O(n) version or something like xformers?
Now, I have gone into the details and it is using Deformable attention, meaning we don't need the O(n) version of the transformer. Please correct me if I am wrong. Maybe we can fine-tune it using LORA? I don't know how straightforward this process is in this model. Could you suggest some ideas for fine-tuning your biggest model on a single RTX 3090 Ti machine (24GB VRAM)?
Hi, @ck-amrahd Thanks for your question.
Actually, we haven't pre-trained relation_detr_focalnet_large_LRF_fl4_800_1333 on COCO. To fine-tune the custom dataset, you can directly load the weight of 1200_2000 into the model of 800_1333 as Image_size will not change the model architecture.
The model accuracy of image_size (800 800,1333) should be a little lower than that of (1200,2000), but it will reduce much memory.
The memory cost mainly comes from the backbone
and transformer encoder
. Here are some advice to reduce memory when fine-tuning:
For backbone:
Freeze more layers: Each backbone has 4 stages indexed by (0, 1, 2, 3), by default we freeze no stages. You can freeze more layers with freeze_indices
to save memory, for example freeze_indices=(0,)
or freeze_indices=(0, 1)
.
Use fewer feature maps for transformer: please set return_indices=(1, 2, 3)
for focalnet_large, if will reduce about 50% tokens for transformer encoder and only reduce accuracy by about 1~2 AP on COCO.
Here is a backbone setting with a better trade-off between GPU memory and accuracy.
backbone = FocalNetBackbone("focalnet_large_lrf_fl4", weights=False, return_indices=(1, 2, 3), freeze_indices=(0,))
For transformer encoder:
Yes we don't need O(n) attention since deformable attention has been a O(n) version.
LoRA mainly solves the memory problem caused by large parameters by decomposing W into low rank matrix. But the memory of Relation-DETR mainly comes from the intermediate output of the model, not the model parameters. Our model may not need LoRA. If you want to try it, you can wrap the following linear layers in the MultiScaleDeformableAttention of the transformer encoder with LoRA.
self.attention_weights = nn.Linear(embed_dim, num_heads * num_levels * num_points)
self.value_proj = nn.Linear(embed_dim, embed_dim)
self.output_proj = nn.Linear(embed_dim, embed_dim)
Thank you @xiuqhou, I will give it a try.
Hi @xiuqhou Thanks for the idea. I am able to fine-tune the larger model with:
backbone = FocalNetBackbone("focalnet_large_lrf_fl4", weights=False, return_indices=(2, 3), freeze_indices=(0, 1))
I also reduced the number of queries and the number of hybrid proposals and now I am able to fine-tune the version that takes 1200 * 2000 images. However, the model fine-tuned with this setting performed poorly compared to a model that I fine-tuned from the Swin-L backbone. I am not able to figure out why. I will keep looking into it. In the meantime, do you have any intuition over why that may be the case?
Hi, @ck-amrahd Thanks for your feedback. If you can fine-tune the Swin-L backbone, the FocalNet-Large backbone should also work under the same settings, because they have similar memory cost. Therefore, I suggest that you keep the backbone settings consistent with Swin-L, including:
Please use return_indexes = (1, 2, 3)
instead of return_ indexes = (2, 3)
. Using fewer indexes will seriously affect performance and will not reduce memory cost too much. From index 0 to index 3, the memory cost of each index will only be about 1/4 of the previous one. So return_indexes = (1, 2, 3)
is totally enough.
Did you change min_size
and max_size
, but left train_transform
in train_config.py unchanged? They should be changed correspondingly. strong_album
transform should be used with 800 * 1333
version and strong_album_1200_2000
should be used with 1200 * 2000
version. You can define your own data_augmentation for other image_size.
min_size
andmax_size
affect image_size for inference andtrain_transform
affect image_size for training.
On the other hand, I strongly suggest you using 800 * 1333 version since larger image_sizes have marginal diminishing returns but will increase memory_cost largely. And please do not reduce or increase num_queries
and hybrid_proposals
as they have a large affect on final performance.
To make it simple, I put my own changed 800 * 1333 model_configs for Focal-large here. I have successfully run it on my own 3090 GPU with with bf16
and strong_album
train_transform. It should have better performance than swin-L. You can load checkpoint (https://github.com/xiuqhou/Relation-DETR/releases/download/v1.0.0/relation_detr_focalnet_large_lrf_fl4_o365_4e-coco_2x_1200_2000.pth) directly and fine-tune it.
from torch import nn
from models.backbones.focalnet import FocalNetBackbone
from models.bricks.position_encoding import PositionEmbeddingSine
from models.bricks.post_process import PostProcess
from models.bricks.relation_transformer import (
RelationTransformer,
RelationTransformerDecoder,
RelationTransformerEncoder,
RelationTransformerEncoderLayer,
RelationTransformerDecoderLayer,
)
from models.bricks.set_criterion import HybridSetCriterion
from models.detectors.relation_detr import RelationDETR
from models.matcher.hungarian_matcher import HungarianMatcher
from models.necks.channel_mapper import ChannelMapper
# mostly changed parameters
embed_dim = 256
num_classes = 91
num_queries = 900
hybrid_num_proposals = 1500
hybrid_assign = 6
num_feature_levels = 5
transformer_enc_layers = 6
transformer_dec_layers = 6
num_heads = 8
dim_feedforward = 2048
# instantiate model components
position_embedding = PositionEmbeddingSine(
embed_dim // 2, temperature=10000, normalize=True, offset=-0.5
)
backbone = FocalNetBackbone("focalnet_large_lrf_fl4", weights=False, return_indices=(1, 2, 3), freeze_indices=(0,))
neck = ChannelMapper(backbone.num_channels, out_channels=embed_dim, num_outs=num_feature_levels)
transformer = RelationTransformer(
encoder=RelationTransformerEncoder(
encoder_layer=RelationTransformerEncoderLayer(
embed_dim=embed_dim,
n_heads=num_heads,
dropout=0.0,
activation=nn.ReLU(inplace=True),
n_levels=num_feature_levels,
n_points=4,
d_ffn=dim_feedforward,
),
num_layers=transformer_enc_layers,
),
decoder=RelationTransformerDecoder(
decoder_layer=RelationTransformerDecoderLayer(
embed_dim=embed_dim,
n_heads=num_heads,
dropout=0.0,
activation=nn.ReLU(inplace=True),
n_levels=num_feature_levels,
n_points=4,
d_ffn=dim_feedforward,
),
num_layers=transformer_dec_layers,
num_classes=num_classes,
),
num_classes=num_classes,
num_feature_levels=num_feature_levels,
two_stage_num_proposals=num_queries,
hybrid_num_proposals=hybrid_num_proposals,
)
matcher = HungarianMatcher(
cost_class=2, cost_bbox=5, cost_giou=2, focal_alpha=0.25, focal_gamma=2.0
)
# construct weight_dict for loss
weight_dict = {"loss_class": 1, "loss_bbox": 5, "loss_giou": 2}
weight_dict.update({"loss_class_dn": 1, "loss_bbox_dn": 5, "loss_giou_dn": 2})
aux_weight_dict = {}
for i in range(transformer.decoder.num_layers - 1):
aux_weight_dict.update({k + f"_{i}": v for k, v in weight_dict.items()})
weight_dict.update(aux_weight_dict)
weight_dict.update({"loss_class_enc": 1, "loss_bbox_enc": 5, "loss_giou_enc": 2})
weight_dict.update({k + "_hybrid": v for k, v in weight_dict.items()})
criterion = HybridSetCriterion(
num_classes=num_classes, matcher=matcher, weight_dict=weight_dict, alpha=0.25, gamma=2.0
)
postprocessor = PostProcess(select_box_nums_for_evaluation=300)
# combine above components to instantiate the model
model = RelationDETR(
backbone=backbone,
neck=neck,
position_embedding=position_embedding,
transformer=transformer,
criterion=criterion,
postprocessor=postprocessor,
num_classes=num_classes,
num_queries=num_queries,
hybrid_assign=hybrid_assign,
denoising_nums=100,
min_size=800,
max_size=1333,
)
Hi @xiuqhou, Thank you so much for the detailed feedback. I will try it and let you know.
Hi @xiuqhou Thank you for the feedback. I am now fine-tuning the large focal net model on a custom dataset. I am able to fine-tune according to your instructions. I am now fine-tuning for 1-2 epochs due to the computational cost. The total loss starts around 60 and goes to around 40 at the end of the first epoch. I will train for longer epochs, but this loss seems quite high for the detection task. Do you have any intuition for this? Is that what you observe when fine-tuning on some datasets?
Hi @ck-amrahd The total loss for the first epoch looks OK. Our method has an extra branch compared to DETRs like DINO, so it contains more loss terms and a larger total loss. When I trained Relation-DETR on COCO, the loss also started around 60 and went to around 35 at the end of the first epoch. It is similar to your result and the difference may comes from different sizes of datasets. As long as the loss goes down steadily, the training process should be OK. Your can refer to our released training log for details.
Question
Hi, Thanks for the awesome repo. I am trying to finetune your model on a custom dataset. My GPU memory is not enough to finetune relation_detr_focalnet_large_lrf_fl4_1200_2000.py version. I have tried with batch_size=1, and "fp16" mixed precision training. Could you please release the weights and accuracy infor for the relation_detr_focalnet_large_lrf_fl4_800_1333.py version? Thank you.
Additional
No response