wusize / ovdet

[CVPR2023] Code Release of Aligning Bag of Regions for Open-Vocabulary Object Detection
https://openaccess.thecvf.com/content/CVPR2023/papers/Wu_Aligning_Bag_of_Regions_for_Open-Vocabulary_Object_Detection_CVPR_2023_paper.pdf
Other
172 stars 4 forks source link

About using CLIP and Captions in combination as supervision. #31

Closed xiaoyi728 closed 11 months ago

xiaoyi728 commented 11 months ago

Hello, while attempting to replicate the results, I noticed that there are experiments in Table 1 of the paper that involve the combination of CLIP and captions, but in the repository, there is code only for individual caption experiments and knowledge distillation (KD). How can I modify the code to obtain results for this combined part of the experiments? 图片

xiaoyi728 commented 11 months ago

@wusize Hello, I apologize for interrupting your valuable weekend time. I kindly request your assistance in reviewing this issue and providing guidance. Thank you once again.

wusize commented 11 months ago

Hi! This repo naturally supports training with multiple resources. It would look like

# dataset settings
_base_ = 'mmdet::_base_/datasets/coco_detection.py'
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
file_client_args = dict(backend='disk')
branch_field = ['det_batch', 'kd_batch', 'caption_batch']
det_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', prob=0.5),
    # dict(type='PackDetInputs')
    dict(type='MultiBranch',
         branch_field=branch_field,
         det_batch=dict(type='PackDetInputs'))
]

kd_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', prob=0.5),
    # dict(type='PackDetInputs')
    dict(type='MultiBranch',
         branch_field=branch_field,
         kd_batch=dict(type='PackDetInputs')
         )
]
cap_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', prob=0.5),
    # dict(type='PackDetInputs')
    dict(type='MultiBranch',
         branch_field=branch_field,
         caption_batch=dict(type='PackDetInputs',
                            meta_keys=['img_id', 'img_path', 'ori_shape',
                                       'img_shape', 'scale_factor',
                                       'flip', 'flip_direction', 'captions',
                                       'tags', 'image_ids']
                            )
         )
]
det_dataset = dict(
    type='CocoDataset',
    data_root=data_root,
    ann_file='annotations/instances_train2017_base.json',
    data_prefix=dict(img='train2017/'),
    filter_cfg=dict(filter_empty_gt=True, min_size=32),
    pipeline=det_pipeline)

kd_dataset = dict(
    type='CocoDataset',
    data_root=data_root,
    ann_file='annotations/instances_train2017_base.json',   # the gt boxes of base categories might be used
    data_prefix=dict(img='train2017/'),
    filter_cfg=dict(filter_empty_gt=False),
    pipeline=kd_pipeline
)
cap_dataset = dict(
    type='CocoCaptionOVDDataset',
    data_root=data_root,
    ann_file='wusize/captions_train2017_tags_allcaps.json',
    data_prefix=dict(img='train2017/'),
    filter_cfg=dict(filter_empty_gt=False),
    pipeline=cap_pipeline
)
batch_split = [2, 2, 2]
train_dataloader = dict(
    batch_size=sum(batch_split),
    num_workers=sum(batch_split),
    persistent_workers=True,
    sampler=dict(type='CustomGroupMultiSourceSampler',
                 batch_size=sum(batch_split),
                 source_ratio=batch_split),
    batch_sampler=None,
    dataset=dict(
        _delete_=True,
        type='ConcatDataset',
        datasets=[det_dataset, kd_dataset, cap_dataset])
)

kd_cfg = dict(type='BaronKD',
               boxes_cache=None,
               use_gt=True,
               bag_weight=1.0, single_weight=0.1, use_attn_mask=False,
               bag_temp=30.0, single_temp=50.0,
               clip_data_preprocessor=dict(
                   type='ImgDataPreprocessor',
                   bgr_to_rgb=True,
                   mean=[122.7709383 - 123.675,
                         116.7460125 - 116.28,
                         104.09373615 - 103.53],
                   std=[68.5005327, 66.6321579, 70.32316305]),
               num_words=4, word_dim=512, words_drop_ratio=0.5,
               queue_cfg=dict(names=['clip_text_features', 'clip_image_features',
                                     'clip_word_features', 'clip_patch_features'],
                              lengths=[1024] * 4,
                              emb_dim=512, id_length=1),
               sampling_cfg=dict(shape_ratio_thr=0.25,
                                 area_ratio_thr=0.01,
                                 objectness_thr=0.85,
                                 nms_thr=0.1,
                                 topk=300,
                                 max_groups=3,
                                 max_permutations=2,
                                 alpha=3.0,
                                 cut_off_thr=0.3,
                                 base_probability=0.3,
                                 interval=-0.1,
                                 ),
               )

cap_cfg = dict(type='BaronCaption',
               loss_weight=10.0, norm_temp=30.0, max_caps=5,
               num_words=4, word_dim=512,
               words_drop_ratio=0.5,
               use_pe=True, queue_cfg=dict(names=['clip_cap_text_features',
                                                  'clip_caption_features'],
                                           lengths=[1024] * 2,
                                           emb_dim=512, id_length=1),
               sampling_cfg=dict(shape_ratio_thr=0.25,
                                 area_ratio_thr=0.01,
                                 objectness_thr=0.85,
                                 nms_thr=0.2,
                                 max_num=10,
                                 topk=128,
                                 max_perms=4
                                 )
               )

model = dict(
    batch2ovd=dict(kd_batch='baron_kd', caption_batch='baron_caption'),
    roi_head=dict(
        type='OVDStandardRoIHead',
        ovd_cfg=dict(baron_kd=kd_cfg, baron_caption=cap_cfg),
        ),
    ),
)

You can obtain region proposals from this link. They provided both class-specific and class-agnostic proposals. The class-specific proposals can be directly added to the detection training branch. And the class-agnostic proposals can be added to the Caption and KD branch. This use of external region proposals leads to significantly unfair comparison with other methods and I do not recommand to do so (unless required by reviewers to compare with link).

Unfortunately, I don't have too much time before CVPR DDL, so I can not provide more help on the implementation.

xiaoyi728 commented 11 months ago

Thank you for your reply. I think it will be of great help to me

---Original--- From: "Size Wu @.> Date: Sat, Oct 21, 2023 23:32 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/ovdet] About using CLIP and Captions in combination as supervision. (Issue #31)

Hi! This repo naturally supports training with multiple resources. It would look like

dataset settings base = 'mmdet::base/datasets/coco_detection.py' dataset_type = 'CocoDataset' data_root = 'data/coco/' file_client_args = dict(backend='disk') branch_field = ['det_batch', 'kd_batch', 'caption_batch'] det_pipeline = [ dict(type='LoadImageFromFile', file_client_args=file_client_args), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', prob=0.5), # dict(type='PackDetInputs') dict(type='MultiBranch', branch_field=branch_field, det_batch=dict(type='PackDetInputs')) ] kd_pipeline = [ dict(type='LoadImageFromFile', file_client_args=file_client_args), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', prob=0.5), # dict(type='PackDetInputs') dict(type='MultiBranch', branch_field=branch_field, kd_batch=dict(type='PackDetInputs') ) ] cap_pipeline = [ dict(type='LoadImageFromFile', file_client_args=file_client_args), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', prob=0.5), # dict(type='PackDetInputs') dict(type='MultiBranch', branch_field=branch_field, caption_batch=dict(type='PackDetInputs', meta_keys=['img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction', 'captions', 'tags', 'image_ids'] ) ) ] det_dataset = dict( type='CocoDataset', data_root=data_root, ann_file='annotations/instances_train2017_base.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=True, min_size=32), pipeline=det_pipeline) kd_dataset = dict( type='CocoDataset', data_root=data_root, ann_file='annotations/instances_train2017_base.json', # the gt boxes of base categories might be used data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False), pipeline=ovd_pipeline ) cap_dataset = dict( type='CocoCaptionOVDDataset', data_root=data_root, ann_file='wusize/captions_train2017_tags_allcaps.json', data_prefix=dict(img='train2017/'), filter_cfg=dict(filter_empty_gt=False), pipeline=ovd_pipeline ) batch_split = [2, 2, 2] train_dataloader = dict( batch_size=sum(batch_split), num_workers=sum(batch_split), persistent_workers=True, sampler=dict(type='CustomGroupMultiSourceSampler', batch_size=sum(batch_split), source_ratio=batch_split), batch_sampler=None, dataset=dict( delete=True, type='ConcatDataset', datasets=[det_dataset, kd_dataset, cap_dataset]) ) kd_cfg = dict(type='BaronKD', boxes_cache=None, use_gt=True, bag_weight=1.0, single_weight=0.1, use_attn_mask=False, bag_temp=30.0, single_temp=50.0, clip_data_preprocessor=dict( type='ImgDataPreprocessor', bgr_to_rgb=True, mean=[122.7709383 - 123.675, 116.7460125 - 116.28, 104.09373615 - 103.53], std=[68.5005327, 66.6321579, 70.32316305]), num_words=4, word_dim=512, words_drop_ratio=0.5, queue_cfg=dict(names=['clip_text_features', 'clip_image_features', 'clip_word_features', 'clip_patch_features'], lengths=[1024] 4, emb_dim=512, id_length=1), sampling_cfg=dict(shape_ratio_thr=0.25, area_ratio_thr=0.01, objectness_thr=0.85, nms_thr=0.1, topk=300, max_groups=3, max_permutations=2, alpha=3.0, cut_off_thr=0.3, base_probability=0.3, interval=-0.1, ), ) cap_cfg = dict(type='BaronCaption', loss_weight=10.0, norm_temp=30.0, max_caps=5, num_words=4, word_dim=512, words_drop_ratio=0.5, use_pe=True, queue_cfg=dict(names=['clip_cap_text_features', 'clip_caption_features'], lengths=[1024] 2, emb_dim=512, id_length=1), sampling_cfg=dict(shape_ratio_thr=0.25, area_ratio_thr=0.01, objectness_thr=0.85, nms_thr=0.2, max_num=10, topk=128, max_perms=4 ) ) model = dict( batch2ovd=dict(kd_batch='baron_kd', caption_batch='baron_caption'), roi_head=dict( type='OVDStandardRoIHead', ovd_cfg=dict(baron_kd=kd_cfg, baron_caption=cap_cfg), ), ), )

You can obtain region proposals from this link. They provided both class-specific and class-agnostic proposals. The class-specific proposals can be directly added to the detection training branch. And the class-agnostic proposals can be added to the Caption and KD branch. This use of external region proposals leads to significantly unfair comparison with other methods and I do not recommand to do so (unless required by reviewers to compare with link).

Unfortunately, I don't have too much time before CVPR DDL, so I can not provide more details on the implementation.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xiaoyi728 commented 11 months ago

Thank you for your reply