Closed tnickMoxuan closed 1 year ago
Because I applied a min-max normalization, which makes scores varied from 0 to 1 for each text. Thus, even the context is not appearing, there are still high response points.
The suggested way is to measure the existence of target text via the similarity of cls token: prob = image_features[:, :1, :] @ text_features.t() prob = prob.softmax(-1) Then set a threshold to ignore irreverent texts.
all_texts = ["hat",'shoes'] for the above image when caculate: prob = image_features[:, :1, :] @ text_features.t() prob = prob.softmax(-1) the value of prob is 0..49 and 0.50, which is almost equal. but the point of 0.5 is not Accurate and lead a pool segmentation。
These two objects are all existed in this image, so it's very likely to meet similar score. For open set, I suggest dont use the softmax, setting a threshold like 0.2 to the cosine similarity may suit your need.
The segmentation effect is good when there are existed object in the image, but when a random prompts text is given, a high-confidence point will also be generated, which will cause segmentation errors in the downstream SAM model. How can this situation be solved? here is some case: I need to find the bag in the image , but the bag actually is not in the image,but the generated point is also high score.