xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
367 stars 26 forks source link

How to control the accurance of generate point? #12

Closed tnickMoxuan closed 1 year ago

tnickMoxuan commented 1 year ago

The segmentation effect is good when there are existed object in the image, but when a random prompts text is given, a high-confidence point will also be generated, which will cause segmentation errors in the downstream SAM model. How can this situation be solved? here is some case: I need to find the bag in the image , but the bag actually is not in the image,but the generated point is also high score. image

Eli-YiLi commented 1 year ago

Because I applied a min-max normalization, which makes scores varied from 0 to 1 for each text. Thus, even the context is not appearing, there are still high response points.

The suggested way is to measure the existence of target text via the similarity of cls token: prob = image_features[:, :1, :] @ text_features.t() prob = prob.softmax(-1) Then set a threshold to ignore irreverent texts.

tnickMoxuan commented 1 year ago

all_texts = ["hat",'shoes'] for the above image when caculate: prob = image_features[:, :1, :] @ text_features.t() prob = prob.softmax(-1) the value of prob is 0..49 and 0.50, which is almost equal. but the point of 0.5 is not Accurate and lead a pool segmentation。

Eli-YiLi commented 1 year ago

These two objects are all existed in this image, so it's very likely to meet similar score. For open set, I suggest dont use the softmax, setting a threshold like 0.2 to the cosine similarity may suit your need.