xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
367 stars 26 forks source link

SAM struggles to correctly mask small objects #4

Closed ThereforeGames closed 1 year ago

ThereforeGames commented 1 year ago

Hi,

First of all, great work! I have implemented CLIP_Surgery in my project and can confirm that it's better than clipseg at certain tasks.

However, I'm having a hard time getting it to make decent selections of small objects when using SAM. Let me give you an example:

Source image:

CLIP_Surgery selection of "hand" without SAM:

CLIP_Surgery selection of "hand" with SAM (it selected the background?):

clipseg selection of "hand":


Now, SAM outputs 3 different masks but the one above was selected as having the highest score per masks[np.argmax(scores)]. But if I look at the outputs, I can see that it really should have preferred mask0 in this case:

image

Is this an issue with CLIP_Surgery's implementation of SAM or SAM itself?

Also, even the best SAM mask seems to include a lot of background noise not present in clipseg's output. Is there an easy way to filter that out?

Thanks!

Eli-YiLi commented 1 year ago

Thanks for your attention.

We also notice this problem, the SAM sometimes produces masks with noises like mask0.png. For your case, it seems there are some points on the background (you can draw the points to check it). My suggests are listed below:

  1. use higher resolution for dense predictions to suit small objects
  2. increase the threshold from mask to points, this helps to avoid false points
  3. change the way to use CLIP Surgery: you can get segments from SAM firstly, then count the average scores of similarity map of each segment, finally select the segment over than a threshold. This solution help to reduce noises like mask0.png
  4. use other strategy like cascade: get a rough region firstly, then crop it and run again

Our core target is to explain the CLIP via visualization, and how do you use it influences the final results a lot.

ThereforeGames commented 1 year ago

Thank you for your insight, @Eli-YiLi!

I raised the threshold variable from 0.8 to 0.98 and observe a great improvement to the masks, check it out:

image

That said, it's still selecting mask2.png as having the highest score... would your third suggestion make a difference in this regard? I need to do some research to learn how the scores are actually calculated.

Eli-YiLi commented 1 year ago

SAM has no idea about the class, instead it matches the similar semantics like skin color, and thus leads to false masks.

The third suggestion above is worthy to try, because the similarity map is aware about the category. One simple idea is that, you can count the mean score on the similarity map for all segments from mask0-1-2, and select the mask at highest score x, and other segments higher than x - threshold.

Besides, another solution is that: firstly segment every segment via SAM without prompts, secondly use similarity map (for single text)/ open-vocabulary segmentation mask (for a label set) to select masks.