Closed ThereforeGames closed 1 year ago
Thanks for your attention.
We also notice this problem, the SAM sometimes produces masks with noises like mask0.png. For your case, it seems there are some points on the background (you can draw the points to check it). My suggests are listed below:
Our core target is to explain the CLIP via visualization, and how do you use it influences the final results a lot.
Thank you for your insight, @Eli-YiLi!
I raised the threshold variable from 0.8 to 0.98 and observe a great improvement to the masks, check it out:
That said, it's still selecting mask2.png as having the highest score... would your third suggestion make a difference in this regard? I need to do some research to learn how the scores are actually calculated.
SAM has no idea about the class, instead it matches the similar semantics like skin color, and thus leads to false masks.
The third suggestion above is worthy to try, because the similarity map is aware about the category. One simple idea is that, you can count the mean score on the similarity map for all segments from mask0-1-2, and select the mask at highest score x, and other segments higher than x - threshold.
Besides, another solution is that: firstly segment every segment via SAM without prompts, secondly use similarity map (for single text)/ open-vocabulary segmentation mask (for a label set) to select masks.
Hi,
First of all, great work! I have implemented CLIP_Surgery in my project and can confirm that it's better than clipseg at certain tasks.
However, I'm having a hard time getting it to make decent selections of small objects when using SAM. Let me give you an example:
Source image:
CLIP_Surgery selection of "hand" without SAM:
CLIP_Surgery selection of "hand" with SAM (it selected the background?):
clipseg selection of "hand":
Now, SAM outputs 3 different masks but the one above was selected as having the highest score per
masks[np.argmax(scores)]
. But if I look at the outputs, I can see that it really should have preferred mask0 in this case:Is this an issue with CLIP_Surgery's implementation of SAM or SAM itself?
Also, even the best SAM mask seems to include a lot of background noise not present in clipseg's output. Is there an easy way to filter that out?
Thanks!