xmed-lab / CLIP_Surgery

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
358 stars 24 forks source link

About the results of Clip surgery with SAM #2

Closed seventy7796 closed 1 year ago

seventy7796 commented 1 year ago
  1. I run the demo with CLIP "CS-ViT-B/16", and the results of "person" is as following with wrong points: image

Could you help me find my errors?

  1. More, I try "a person on the bench" and other sentences. Does the results with a long sentence input or with some extra description not work?
  2. Could you share the training dataset of "CS-ViT-B/16"?

Thanks a lot.

Eli-YiLi commented 1 year ago

Note that, the demo gives a label set, and CLIP Surgery computes the redundant features from this label set.

I guess you just use a sentence without a label set. In such condition, you should follow the example of for a single text:

CLIP Surgery for a single text, without fixed label sets

texts = ['shoes']

with torch.no_grad():

clip architecture surgery acts on the image encoder

image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=1, keepdim=True)

# prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, texts, device)

# extract redundant features from an empty string
redundant_features = clip.encode_text_with_prompt_ensemble(model, [""], device)

# apply feature surgery for single text
similarity = clip.clip_feature_surgery(image_features, text_features, redundant_features)
similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2])

This example use the empty string as a redundant feature, and then you can use the similarity map to get points for SAM.

Your attempt also shows the necessity of feature surgery, without it the noises lead to false points.

For multiple words, I will try it.

seventy7796 commented 1 year ago

@Eli-YiLi

Thank you for your reply. I follow your code and get the new results which are still not as good as yours. `all_texts = ['a person in the bench','person', 'bench'] target_texts = ['a person in the bench','person', 'bench'] sam_checkpoint = "sam_vit_h_4b8939.pth" model_type = "vit_h" sam = sam_model_registrymodel_type sam.to(device=device) predictor = SamPredictor(sam) predictor.set_image(np.array(pil_img)) with torch.no_grad():

clip architecture surgery acts on the image encoder

image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=1, keepdim=True)

# prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)

# extract redundant features from an empty string
redundant_features = clip.encode_text_with_prompt_ensemble(model, [""], device)

# apply feature surgery for single text
similarity = clip.clip_feature_surgery(image_features, text_features, redundant_features)[0]
# inference SAM with points from CLIP Surgery
for n in range(similarity.shape[-1]):
    if all_texts[n] not in target_texts:
        continue
    print('similarity',similarity.shape)
    points, labels = clip.similarity_map_to_points(similarity[1:, n], cv2_img.shape[:2], t=0.8)
    masks, scores, logits = predictor.predict(point_labels=labels, point_coords=np.array(points), multimask_output=True)
    mask = masks[np.argmax(scores)]
    mask = mask.astype('uint8')`

text1: person image

text2: a person on the bench image

text3: bench image

Eli-YiLi commented 1 year ago

I update the demo.ipynb: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb In the last part, your case is satisfied.

image

For a whole sentence, some obvious texts may take the lead, thus overlook the rest texts (you can draw the similarity map). So, my suggestion is to treat different text individually, or adjust the threshold to pick points (e.g. 0.8->0.7)

xiexie123 commented 1 year ago

for question 3. there is no trainning referred to the papre