Closed AhmedBourouis closed 1 year ago
I'm also confused by this:
I expected that CLIP_Surgery would have more parameters since you added that extra path for v-v attention..
For fine-tuning: I want to emphasize that explainability does not require fine-tuning in general. If you want to fine-tune, it's better to fix the backbone and add new heads for out-domain datasets, which will not collapse. Besides, fine-tuning based open-vocabulary methods may suit your need, since some methods add local token grounding.
For parameters: We don't use any new parameters, and v-v attention is from the original parameters. We just add a new inference path using original shared params.
@Eli-YiLi Thank you for your response!
For question1: As shown in this figure, open-vocabulary segmentation uses the output of new path with argmax (feature surgery is not applied, because redundant feature is common bias without influence on argmax). And open-vocab. multi-label recognition uses feature surgery as post-processing, so it's able to apply on other fine-tuned classification methods (using their features with our feature surgery). And multimodal visualization utilizes the same img tokens as explainability task, with text token before max pooling.
For question2: In fact, we use multiple deeper self-attention layers, instead of the last one, and simply skip all FFNs. Maybe it works better to select blocks ranked by cosine.
That answered my questions! Thank you
Hi! thank you for this good work and neat implementation Have you tried training/fine-tuning CLIP_surgery on out of domain datasets (medical scans, drawings ..etc)? Do you think that would improve the mIoU on these dataset or the model would collapse?