wusize / CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
https://arxiv.org/abs/2310.01403
Other
167 stars 9 forks source link

Is it intentional or a mistake that coco_proposals.json and coco_pseudo_4764.json are completely identical. #15

Open Bilibilee opened 8 months ago

Bilibilee commented 8 months ago

in this Drive,Is it intentional or a mistake that coco_proposals.json and coco_pseudo_4764.json are completely identical.

wusize commented 8 months ago

It's intentional. We need to make sure both methods used the same set of region proposals to fairly verify that self distillation is better than noisy region-text pairs. Kindly note that category ids were not used during CLIPSelf training even they were in the json.

kinredon commented 7 months ago

@wusize Hi, I am curious about how to obtain the region proposals and corresponding category ids when the model is only trained on the base categories? I found the category ids in coco_proposals.json are numerous.

wusize commented 7 months ago

@wusize Hi, I am curious about how to obtain the region proposals and corresponding category ids when the model is only trained on the base categories? I found the category ids in coco_proposals.json are numerous.

Hi! Please refer to A.4 in the appendix of the paper. You can also have a look at the data preparation of VLDet or RegionCLIP.

image

kinredon commented 7 months ago

@wusize Thanks for your quick reply. I have read Appendix A.4 In this paper and checked the data preparation of VLDet, but there is no information about how to generate the region proposals.

I would like to leverage the coco_proposals.json to improve my project, thus I need to understand how the coco_proposals.json are generated. Can you provide some information on how to obtain coco_proposals.json or where you downloaded it?

Great thanks again!

wusize commented 7 months ago
  1. Train an PRN on base categories of COCO or obtain the RPN part of any off-the-shelf ov detector trained on coco.
  2. Use the RPN to generate proposals.
  3. Extract CLIP image embeddings for these proposals.
  4. Parse each COCO caption into a group of nouns or phrases.
  5. Extract CLIP text embeddings for these nouns/phrases.
  6. Do bipartite matching between the image embeddings and text embeddings.
kinredon commented 7 months ago

I got it. Thanks!

kinredon commented 1 month ago
  1. Train an PRN on base categories of COCO or obtain the RPN part of any off-the-shelf ov detector trained on coco.
  2. Use the RPN to generate proposals.
  3. Extract CLIP image embeddings for these proposals.
  4. Parse each COCO caption into a group of nouns or phrases.
  5. Extract CLIP text embeddings for these nouns/phrases.
  6. Do bipartite matching between the image embeddings and text embeddings.

@wusize Hi, when I checked the generated proposals coco_pseudo_4764.json, I found there are many differences in the category ids between the coco_pseudo_4764.json and that in VLDet. For example:

image

The number of category ids is smaller than that in VLDet, so the last step (6) is not a simple bipartite matching. Do you have some filter operation? Hope you can give me some suggestions. Many thanks to you.