Closed Changes92 closed 1 year ago
Hi, thanks for your question!Directly matching the pseudo words with real word embeddings yields poor performance. It might require extra losses to more strictly align the pseudo words to real word embeddings. And the word numbers of category names vary, making the matching complicated. We modified the forward function of Text Encoder so that the context length is not 77 but the number of pseudo words, which makes the inference time acceptable.
Thanks for your reply! Hence, BARON can be seen as the model ensemble between a learnable object detector and a pre-trained ViT encoder. I am confused about the mentioned bad results of directly matching pseudo words (visual embedding) with text embedding, since this vision-to-text alignment seems to be similar to other OVD baselines, e.g., the VLDet.
That's because the pseudo words are aligned to the input space of text encoder. In other OVD baselines such as VLDet, the visual embeddings are aligned to the output space of text encoder.
Thank you for your clarification!
This is really excellent work. I would like to follow your work. Since SOCO is tailor-designed for the vision-language task and is doesn't used in almost all OVD works, do you have the results using the Image-Net pre-trained backbone on the COCO benchmark (42.3 AP_r) ?
I followed DetPro to use the SOCO models for fast convergence. The 42.7 AP_novel on COCO is achieved by using external model for proposal generation to fairly compare with object-centric-ovd.
To run a simple version of our work using image-net pre-trained backbone, you can use this config file. And also you can replace it with other two-stage detectors supported in MMDet.
And thank you for your interest in following this work! I would be active for any question and discussion.
Thanks for your great work!
I'm just wondering about the inference stage. The proposed method still requires the whole Text Encoder in the inference stage for each image, which is different from existing OVD works that only use text embedding. This leads to a significantly large model scale and unsatisfactory inference speed, even doubling the model parameters. Is it possible to remove the text encoder in the inference stage?