about the structure - Githubissues

wusize / ovdet

[CVPR2023] Code Release of Aligning Bag of Regions for Open-Vocabulary Object Detection

https://openaccess.thecvf.com/content/CVPR2023/papers/Wu_Aligning_Bag_of_Regions_for_Open-Vocabulary_Object_Detection_CVPR_2023_paper.pdf

Other

174 stars 4 forks source link

about the structure #34

Closed qiandl2000 closed 1 year ago

qiandl2000 commented 1 year ago

Why can't the model directly align the visual features of the teacher model with the visual features of the student model? I don't understand why we need to use a linear layer to generate pseudo words and then use encoder to get it's feature.