wusize / ovdet

[CVPR2023] Code Release of Aligning Bag of Regions for Open-Vocabulary Object Detection
https://openaccess.thecvf.com/content/CVPR2023/papers/Wu_Aligning_Bag_of_Regions_for_Open-Vocabulary_Object_Detection_CVPR_2023_paper.pdf
Other
172 stars 4 forks source link

about the structure #34

Closed qiandl2000 closed 11 months ago

qiandl2000 commented 11 months ago

Why can't the model directly align the visual features of the teacher model with the visual features of the student model? I don't understand why we need to use a linear layer to generate pseudo words and then use encoder to get it's feature.