xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
368 stars 26 forks source link

How can I get masks of open-vocabulary semantic segmentation #8

Closed halbielee closed 1 year ago

halbielee commented 1 year ago

Thank you for sharing excellent work!

I am trying to get segmentation masks (open-vocabulary) from the code. I tried argmax from "similarity_map" from demo.py, and it showed lower performance.

Is there any way to get a segmentation mask?

Eli-YiLi commented 1 year ago

Firstly, you can get similarity map from architecture surgery without feature surgery. Then, apply argmax on the class dimension. The method performs well on stuff related datasets like coco stuff, while we find object related datasts like voc12 are unsatisfactory. This phenomenon is owing to CLIP, which doesn't align local token to text. It also happens to other methods like MaskCLIP. After all, there is no fine-tuning to enhance segmentation task. Thus, we introduce SAM via its ability about local affinity to refine the mask.