xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
346 stars 22 forks source link

Questions about the open-vocabulary semantic segmentation. #22

Closed SuleBai closed 1 year ago

SuleBai commented 1 year ago

Hi, thanks for your great work.

I am interested in the details about open-vocab segmentation and I have few questions regarding this task.

  1. In the architecture surgery, I'm wondering whether the prediction for segmentation comes from the original path or the new path? Additionally, which features are used in the feature surgery? The paper said "Note that Eq. 9 is specifically designed for the explainability task", but I think the segmentation should use this too?

  2. And it confused me in the [code](https://github.com/xmed-lab/CLIP_Surgery/blob/e346359d67e8fc4fe301467914151316d3982661/clip/clip_surgery_model.py#L349C36-L349C36)

    x[0, :, :] = x_ori[0, :, :] # clip_surgery

    Why do you preserve the [cls] token in the original_path? If my understanding was right, the [cls] token in the original_path is not influenced by the new_path. So for the multi-label recognition task, the architecture surgery would be useless?

  3. Could you give more details? And it would be of great help if you could release the code for the open-vocabulary segmentation.

Thanks again for your work!

Eli-YiLi commented 1 year ago

For issue1 and 2 you can refer to this table (from the revision manuscript under review):

image

For the open-vocabulary segmentation, you just need to use argmax for the output of the new path. The evaluation code will be released after the acceptance.