xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
367 stars 26 forks source link

Train/fine-tune CLIP_Surgery #15

Closed AhmedBourouis closed 1 year ago

AhmedBourouis commented 1 year ago

Hi! thank you for this good work and neat implementation Have you tried training/fine-tuning CLIP_surgery on out of domain datasets (medical scans, drawings ..etc)? Do you think that would improve the mIoU on these dataset or the model would collapse?

AhmedBourouis commented 1 year ago

I'm also confused by this: image image

I expected that CLIP_Surgery would have more parameters since you added that extra path for v-v attention..

Eli-YiLi commented 1 year ago

For fine-tuning: I want to emphasize that explainability does not require fine-tuning in general. If you want to fine-tune, it's better to fix the backbone and add new heads for out-domain datasets, which will not collapse. Besides, fine-tuning based open-vocabulary methods may suit your need, since some methods add local token grounding.

For parameters: We don't use any new parameters, and v-v attention is from the original parameters. We just add a new inference path using original shared params.

AhmedBourouis commented 1 year ago

@Eli-YiLi Thank you for your response!

  1. Can you please reference me to these open-vocabulary methods?
  2. One more question, correct me if I'm wrong please, in fig 5 you are computing the similarity between text and image features from different blocks in both text and visual encoders. The highest similarity occurs at the 10th block. Don't you think that it makes more sense to use these features as the final features?

image

Eli-YiLi commented 1 year ago

For question1: image As shown in this figure, open-vocabulary segmentation uses the output of new path with argmax (feature surgery is not applied, because redundant feature is common bias without influence on argmax). And open-vocab. multi-label recognition uses feature surgery as post-processing, so it's able to apply on other fine-tuned classification methods (using their features with our feature surgery). And multimodal visualization utilizes the same img tokens as explainability task, with text token before max pooling.

For question2: In fact, we use multiple deeper self-attention layers, instead of the last one, and simply skip all FFNs. Maybe it works better to select blocks ranked by cosine.

AhmedBourouis commented 1 year ago

That answered my questions! Thank you