Open stonewst opened 2 years ago
Hi @wusitong98,
Hi @wusitong98,
- The "clip_feat.pkl" refers to the offline pre-computed CLIP image features. I will release it this week. Please stay tuned.
- Yes, will update this week.
Thanks for your reply!
I still have some questions about the generation process of clip_feat.pkl as follows:
The code index = torch.randperm(len(self.clip_feat[cat_id]))[0:1]
in file "ovdetr/models/model.py" line 297 seems that each category has multiple CLIP image features, one of which is selected randomly as conditional image input during training.
Is "the number of CLIP image feature of category i" equals to "the number of bbox of category i among training data"? To be specific, let's assume that there are M ground-truth(gt) bbox belonging to "cat" among all the training data. The regions corresponding to these M gt bbox are cropped from the image, and then passed through the pretrained CLIP image model to generate M feature vector. Thus, in clip_feat.pkl, the category "cat" has M CLIP features. Am I right? Could you provide more details if my understanding is wrong? And could you share your clip_feat.pkl generation script? Sincere thanks!
Why not generate CLIP image feature online?
Hi @wusitong98
Thanks for your positive and quick reply! It helps me a lot.
I really appreciate your idea, using conditional binary matching to enable the DETR architecture for the open-vocabulary object detection task. After reading carefully, I still have some questions, as follows:
1. About the training settings: I could not find training hyper-parameters (such as batch size, learning rate, max_length, etc) from the paper.
2. About the training data: As mentioned in paper, novel classes with pseudo bbox (generated object proposals) are also involved in training data. The best performance, 17.4 $AP^m{novel}$ on LVIS and 29.4 $AP50^b{novel}$, are achieved with both base and novel classes as training data.
instances_train2017_seen_2.json
in file "ovdetr/datasets/coco.py" line 272). 3. About "self.seen_ids":
In current code, self.seen_ids
is 0-64 for COCO and 0-1202 for LVIS. That is to say, the CLIP text/image embeddings of novel classes may appear during training, regardless of whether the novel classes are involved in training data. If so, something seems wrong here. Or maybe I understand it wrong?
4. About the classifier to predict matchability:
Benefited by the conditional binary matching strategy, OV-DETR no longer relies on CLIP text embedding as the classifier. Thus, a simple fully connected layer with output channel=1
(file "ovdetr/models/model.py" line 49) is used as the classifier to predict the matchability between each detection result and corresponding conditional input. Is my understanding right?
The row#1
and row#2
in Table 2 use CLIP text embedding as classifier, but row#3
uses fully connected layer rather than CLIP text embedding. Am I right?
5. About “R” :
Thanks for your explanation in Issue, I understand the meaning of R
, but my question lies in the code implementation. It seems that self.max_len
(file "ovdetr/models/model.py" line 246) corresponds to R
in the paper.
self.max_len
is 15. Why not 3 mentioned in the paper?self.max_pad_len
("ovdetr/models/model.py" line 247) ? And why it is set to self.max_len - 3
?Hi @wusitong98,
1 & 2.2 & 5.1: The current code is under-prepared. I will provide the config files and the JSON files with extra proposals for COCO/LVIS datasets. 2.1: I do not try. I guess the results will be similar to the first row of Table 2. 3: These lines are used to sample the conditional queries that are not existing in the image, which helps filter out false-positive cases.
Hi @wusitong98,
1 & 2.2 & 5.1: The current code is under-prepared. I will provide the config files and the JSON files with extra proposals for COCO/LVIS datasets. 2.1: I do not try. I guess the results will be similar to the first row of Table 2. 3: These lines are used to sample the conditional queries that are not existing in the image, which helps filter out false-positive cases. 4. Yes. The shape of classification predictions of N object queries are (N, 1) for our method, which is related to the 'matched’ or `not matched’ probability of conditional inputs for Transformer Decoders. 5.2 To save the GPU memory constraints. You can delete this line if you do not face the out-of-memory issue.
Okay, I understand. Looking forward to your updates. Thanks!
hi, @wusitong98 , do you know why the proposals are claimed to contain novel classes in the training stage? there are no novel classes appearing in the training images, even if the embeddings for the novel classes appear during training.
fyi @yuhangzang , @wusitong98 , i have trained the ov-detr without the clip image embedding, by setting --prob 1.0, the results are: bbox AP seen: 56.81546530654572 bbox AP unseen: 26.61650706917391
which means the clip text embeddings are very useful in the current framework.
fyi @yuhangzang , @wusitong98 , i have trained the ov-detr without the clip image embedding, by setting --prob 1.0, the results are: bbox AP seen: 56.81546530654572 bbox AP unseen: 26.61650706917391
which means the clip text embeddings are very useful in the current framework.
@d12306 Hello, what the hyperparameter used in the reimplement experiments, such as total_epoch, number of GPU and so on. Looking forward to your reply, thanks!
1、when i run train coco with the default configs in main.py, where is out-of-memory issue(Tesla V100S-PCIE-32GB). 2、when i set batch_size=1, the results are different from paper, I'm not sure if it's batchsize or some other configuration.: bbox AP seen: 58.923614572881796(paper is 61) bbox AP unseen: 27.97123823908958(paper is 29.4) @yuhangzang hope for you help
1、when i run train coco with the default configs in main.py, where is out-of-memory issue(Tesla V100S-PCIE-32GB). 2、when i set batch_size=1, the results are different from paper, I'm not sure if it's batchsize or some other configuration.: bbox AP seen: 58.923614572881796(paper is 61) bbox AP unseen: 27.97123823908958(paper is 29.4) @yuhangzang hope for you help
the same out-of-memory issue with Tesla V100-SXM2-32GB
Hi! Thanks for your interesting work on open vocabulary detection. I read the paper and tried to run the code, but had some trouble. Hope for your help!