Questions about the open-vocabulary semantic segmentation.

Hi, thanks for your great work.

I am interested in the details about open-vocab segmentation and I have few questions regarding this task.

In the architecture surgery, I'm wondering whether the prediction for segmentation comes from the original path or the new path? Additionally, which features are used in the feature surgery? The paper said "Note that Eq. 9 is specifically designed for the explainability task", but I think the segmentation should use this too?
And it confused me in the [code](https://github.com/xmed-lab/CLIP_Surgery/blob/e346359d67e8fc4fe301467914151316d3982661/clip/clip_surgery_model.py#L349C36-L349C36)
```
x[0, :, :] = x_ori[0, :, :] # clip_surgery
```
Why do you preserve the [cls] token in the original_path? If my understanding was right, the [cls] token in the original_path is not influenced by the new_path. So for the multi-label recognition task, the architecture surgery would be useless?
Could you give more details? And it would be of great help if you could release the code for the open-vocabulary segmentation.

Thanks again for your work!

xmed-lab / CLIP_Surgery