Closed halbielee closed 1 year ago
Firstly, you can get similarity map from architecture surgery without feature surgery. Then, apply argmax on the class dimension. The method performs well on stuff related datasets like coco stuff, while we find object related datasts like voc12 are unsatisfactory. This phenomenon is owing to CLIP, which doesn't align local token to text. It also happens to other methods like MaskCLIP. After all, there is no fine-tuning to enhance segmentation task. Thus, we introduce SAM via its ability about local affinity to refine the mask.
Thank you for sharing excellent work!
I am trying to get segmentation masks (open-vocabulary) from the code. I tried argmax from "similarity_map" from demo.py, and it showed lower performance.
Is there any way to get a segmentation mask?