yuhangzang / OV-DETR

[Under preparation] Code repo for "Open-Vocabulary DETR with Conditional Matching" (ECCV 2022)
209 stars 22 forks source link

About “clip_feat.pkl” #3

Open stonewst opened 2 years ago

stonewst commented 2 years ago

Hi! Thanks for your interesting work on open vocabulary detection. I read the paper and tried to run the code, but had some trouble. Hope for your help!

  1. How can I get this file "clip_feat.pkl"? What does it mean?
  2. where is the definition of "self.all_ids" in file "ovdetr/models/model.py" line 290? Does it mean "self.seen_ids" in line 245?
yuhangzang commented 2 years ago

Hi @wusitong98,

  1. The "clip_feat.pkl" refers to the offline pre-computed CLIP image features. I will release it this week. Please stay tuned.
  2. Yes, will update this week.
stonewst commented 2 years ago

Hi @wusitong98,

  1. The "clip_feat.pkl" refers to the offline pre-computed CLIP image features. I will release it this week. Please stay tuned.
  2. Yes, will update this week.

Thanks for your reply!

I still have some questions about the generation process of clip_feat.pkl as follows:

The code index = torch.randperm(len(self.clip_feat[cat_id]))[0:1] in file "ovdetr/models/model.py" line 297 seems that each category has multiple CLIP image features, one of which is selected randomly as conditional image input during training.

  1. Is "the number of CLIP image feature of category i" equals to "the number of bbox of category i among training data"? To be specific, let's assume that there are M ground-truth(gt) bbox belonging to "cat" among all the training data. The regions corresponding to these M gt bbox are cropped from the image, and then passed through the pretrained CLIP image model to generate M feature vector. Thus, in clip_feat.pkl, the category "cat" has M CLIP features. Am I right? Could you provide more details if my understanding is wrong? And could you share your clip_feat.pkl generation script? Sincere thanks!

  2. Why not generate CLIP image feature online?

yuhangzang commented 2 years ago

Hi @wusitong98

  1. You are right. The generation script will be released this week.
  2. To reduce the training time.
stonewst commented 2 years ago

Thanks for your positive and quick reply! It helps me a lot.

I really appreciate your idea, using conditional binary matching to enable the DETR architecture for the open-vocabulary object detection task. After reading carefully, I still have some questions, as follows:

1. About the training settings: I could not find training hyper-parameters (such as batch size, learning rate, max_length, etc) from the paper.

2. About the training data: As mentioned in paper, novel classes with pseudo bbox (generated object proposals) are also involved in training data. The best performance, 17.4 $AP^m{novel}$ on LVIS and 29.4 $AP50^b{novel}$, are achieved with both base and novel classes as training data.

3. About "self.seen_ids": In current code, self.seen_ids is 0-64 for COCO and 0-1202 for LVIS. That is to say, the CLIP text/image embeddings of novel classes may appear during training, regardless of whether the novel classes are involved in training data. If so, something seems wrong here. Or maybe I understand it wrong?

4. About the classifier to predict matchability: Benefited by the conditional binary matching strategy, OV-DETR no longer relies on CLIP text embedding as the classifier. Thus, a simple fully connected layer with output channel=1 (file "ovdetr/models/model.py" line 49) is used as the classifier to predict the matchability between each detection result and corresponding conditional input. Is my understanding right? The row#1 and row#2 in Table 2 use CLIP text embedding as classifier, but row#3 uses fully connected layer rather than CLIP text embedding. Am I right?

5. About “R” : Thanks for your explanation in Issue, I understand the meaning of R, but my question lies in the code implementation. It seems that self.max_len (file "ovdetr/models/model.py" line 246) corresponds to R in the paper.

yuhangzang commented 2 years ago

Hi @wusitong98,

1 & 2.2 & 5.1: The current code is under-prepared. I will provide the config files and the JSON files with extra proposals for COCO/LVIS datasets. 2.1: I do not try. I guess the results will be similar to the first row of Table 2. 3: These lines are used to sample the conditional queries that are not existing in the image, which helps filter out false-positive cases.

  1. Yes. The shape of classification predictions of N object queries are (N, 1) for our method, which is related to the 'matched’ or `not matched’ probability of conditional inputs for Transformer Decoders. 5.2 To save the GPU memory constraints. You can delete this line if you do not face the out-of-memory issue.
stonewst commented 2 years ago

Hi @wusitong98,

1 & 2.2 & 5.1: The current code is under-prepared. I will provide the config files and the JSON files with extra proposals for COCO/LVIS datasets. 2.1: I do not try. I guess the results will be similar to the first row of Table 2. 3: These lines are used to sample the conditional queries that are not existing in the image, which helps filter out false-positive cases. 4. Yes. The shape of classification predictions of N object queries are (N, 1) for our method, which is related to the 'matched’ or `not matched’ probability of conditional inputs for Transformer Decoders. 5.2 To save the GPU memory constraints. You can delete this line if you do not face the out-of-memory issue.

Okay, I understand. Looking forward to your updates. Thanks!

d12306 commented 2 years ago

hi, @wusitong98 , do you know why the proposals are claimed to contain novel classes in the training stage? there are no novel classes appearing in the training images, even if the embeddings for the novel classes appear during training.

d12306 commented 2 years ago

fyi @yuhangzang , @wusitong98 , i have trained the ov-detr without the clip image embedding, by setting --prob 1.0, the results are: bbox AP seen: 56.81546530654572 bbox AP unseen: 26.61650706917391

which means the clip text embeddings are very useful in the current framework.

HITerStudy commented 2 years ago

fyi @yuhangzang , @wusitong98 , i have trained the ov-detr without the clip image embedding, by setting --prob 1.0, the results are: bbox AP seen: 56.81546530654572 bbox AP unseen: 26.61650706917391

which means the clip text embeddings are very useful in the current framework.

@d12306 Hello, what the hyperparameter used in the reimplement experiments, such as total_epoch, number of GPU and so on. Looking forward to your reply, thanks!

childlong commented 2 years ago

1、when i run train coco with the default configs in main.py, where is out-of-memory issue(Tesla V100S-PCIE-32GB). 2、when i set batch_size=1, the results are different from paper, I'm not sure if it's batchsize or some other configuration.: bbox AP seen: 58.923614572881796(paper is 61) bbox AP unseen: 27.97123823908958(paper is 29.4) @yuhangzang hope for you help

eternaldolphin commented 2 years ago

1、when i run train coco with the default configs in main.py, where is out-of-memory issue(Tesla V100S-PCIE-32GB). 2、when i set batch_size=1, the results are different from paper, I'm not sure if it's batchsize or some other configuration.: bbox AP seen: 58.923614572881796(paper is 61) bbox AP unseen: 27.97123823908958(paper is 29.4) @yuhangzang hope for you help

the same out-of-memory issue with Tesla V100-SXM2-32GB