Closed xiaoyi728 closed 11 months ago
@wusize Hello, I apologize for interrupting your valuable weekend time. I kindly request your assistance in reviewing this issue and providing guidance. Thank you once again.
Hi! This repo naturally supports training with multiple resources. It would look like
# dataset settings
_base_ = 'mmdet::_base_/datasets/coco_detection.py'
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
file_client_args = dict(backend='disk')
branch_field = ['det_batch', 'kd_batch', 'caption_batch']
det_pipeline = [
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
# dict(type='PackDetInputs')
dict(type='MultiBranch',
branch_field=branch_field,
det_batch=dict(type='PackDetInputs'))
]
kd_pipeline = [
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
# dict(type='PackDetInputs')
dict(type='MultiBranch',
branch_field=branch_field,
kd_batch=dict(type='PackDetInputs')
)
]
cap_pipeline = [
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
# dict(type='PackDetInputs')
dict(type='MultiBranch',
branch_field=branch_field,
caption_batch=dict(type='PackDetInputs',
meta_keys=['img_id', 'img_path', 'ori_shape',
'img_shape', 'scale_factor',
'flip', 'flip_direction', 'captions',
'tags', 'image_ids']
)
)
]
det_dataset = dict(
type='CocoDataset',
data_root=data_root,
ann_file='annotations/instances_train2017_base.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=det_pipeline)
kd_dataset = dict(
type='CocoDataset',
data_root=data_root,
ann_file='annotations/instances_train2017_base.json', # the gt boxes of base categories might be used
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=False),
pipeline=kd_pipeline
)
cap_dataset = dict(
type='CocoCaptionOVDDataset',
data_root=data_root,
ann_file='wusize/captions_train2017_tags_allcaps.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=False),
pipeline=cap_pipeline
)
batch_split = [2, 2, 2]
train_dataloader = dict(
batch_size=sum(batch_split),
num_workers=sum(batch_split),
persistent_workers=True,
sampler=dict(type='CustomGroupMultiSourceSampler',
batch_size=sum(batch_split),
source_ratio=batch_split),
batch_sampler=None,
dataset=dict(
_delete_=True,
type='ConcatDataset',
datasets=[det_dataset, kd_dataset, cap_dataset])
)
kd_cfg = dict(type='BaronKD',
boxes_cache=None,
use_gt=True,
bag_weight=1.0, single_weight=0.1, use_attn_mask=False,
bag_temp=30.0, single_temp=50.0,
clip_data_preprocessor=dict(
type='ImgDataPreprocessor',
bgr_to_rgb=True,
mean=[122.7709383 - 123.675,
116.7460125 - 116.28,
104.09373615 - 103.53],
std=[68.5005327, 66.6321579, 70.32316305]),
num_words=4, word_dim=512, words_drop_ratio=0.5,
queue_cfg=dict(names=['clip_text_features', 'clip_image_features',
'clip_word_features', 'clip_patch_features'],
lengths=[1024] * 4,
emb_dim=512, id_length=1),
sampling_cfg=dict(shape_ratio_thr=0.25,
area_ratio_thr=0.01,
objectness_thr=0.85,
nms_thr=0.1,
topk=300,
max_groups=3,
max_permutations=2,
alpha=3.0,
cut_off_thr=0.3,
base_probability=0.3,
interval=-0.1,
),
)
cap_cfg = dict(type='BaronCaption',
loss_weight=10.0, norm_temp=30.0, max_caps=5,
num_words=4, word_dim=512,
words_drop_ratio=0.5,
use_pe=True, queue_cfg=dict(names=['clip_cap_text_features',
'clip_caption_features'],
lengths=[1024] * 2,
emb_dim=512, id_length=1),
sampling_cfg=dict(shape_ratio_thr=0.25,
area_ratio_thr=0.01,
objectness_thr=0.85,
nms_thr=0.2,
max_num=10,
topk=128,
max_perms=4
)
)
model = dict(
batch2ovd=dict(kd_batch='baron_kd', caption_batch='baron_caption'),
roi_head=dict(
type='OVDStandardRoIHead',
ovd_cfg=dict(baron_kd=kd_cfg, baron_caption=cap_cfg),
),
),
)
You can obtain region proposals from this link. They provided both class-specific and class-agnostic proposals. The class-specific proposals can be directly added to the detection training branch. And the class-agnostic proposals can be added to the Caption and KD branch. This use of external region proposals leads to significantly unfair comparison with other methods and I do not recommand to do so (unless required by reviewers to compare with link).
Unfortunately, I don't have too much time before CVPR DDL, so I can not provide more help on the implementation.
Thank you for your reply. I think it will be of great help to me
---Original--- From: "Size Wu @.> Date: Sat, Oct 21, 2023 23:32 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/ovdet] About using CLIP and Captions in combination as supervision. (Issue #31)
Hi! This repo naturally supports training with multiple resources. It would look like
You can obtain region proposals from this link. They provided both class-specific and class-agnostic proposals. The class-specific proposals can be directly added to the detection training branch. And the class-agnostic proposals can be added to the Caption and KD branch. This use of external region proposals leads to significantly unfair comparison with other methods and I do not recommand to do so (unless required by reviewers to compare with link).
Unfortunately, I don't have too much time before CVPR DDL, so I can not provide more details on the implementation.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Thank you for your reply
Hello, while attempting to replicate the results, I noticed that there are experiments in Table 1 of the paper that involve the combination of CLIP and captions, but in the repository, there is code only for individual caption experiments and knowledge distillation (KD). How can I modify the code to obtain results for this combined part of the experiments?