How can I get masks of open-vocabulary semantic segmentation

Firstly, you can get similarity map from architecture surgery without feature surgery. Then, apply argmax on the class dimension. The method performs well on stuff related datasets like coco stuff, while we find object related datasts like voc12 are unsatisfactory. This phenomenon is owing to CLIP, which doesn't align local token to text. It also happens to other methods like MaskCLIP. After all, there is no fine-tuning to enhance segmentation task. Thus, we introduce SAM via its ability about local affinity to refine the mask.

xmed-lab / CLIP_Surgery

How can I get masks of open-vocabulary semantic segmentation #8