Closed qiandl2000 closed 1 year ago
Why can't the model directly align the visual features of the teacher model with the visual features of the student model? I don't understand why we need to use a linear layer to generate pseudo words and then use encoder to get it's feature.
Why can't the model directly align the visual features of the teacher model with the visual features of the student model? I don't understand why we need to use a linear layer to generate pseudo words and then use encoder to get it's feature.