seventy7796 commented 1 year ago

I run the demo with CLIP "CS-ViT-B/16", and the results of "person" is as following with wrong points:

Could you help me find my errors?

More, I try "a person on the bench" and other sentences. Does the results with a long sentence input or with some extra description not work?
Could you share the training dataset of "CS-ViT-B/16"?

Thanks a lot.

Eli-YiLi commented 1 year ago

Note that, the demo gives a label set, and CLIP Surgery computes the redundant features from this label set.

I guess you just use a sentence without a label set. In such condition, you should follow the example of for a single text:

CLIP Surgery for a single text, without fixed label sets

texts = ['shoes']

with torch.no_grad():

clip architecture surgery acts on the image encoder

image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=1, keepdim=True)

# prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, texts, device)

# extract redundant features from an empty string
redundant_features = clip.encode_text_with_prompt_ensemble(model, [""], device)

# apply feature surgery for single text
similarity = clip.clip_feature_surgery(image_features, text_features, redundant_features)
similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2])

This example use the empty string as a redundant feature, and then you can use the similarity map to get points for SAM.

Your attempt also shows the necessity of feature surgery, without it the noises lead to false points.

For multiple words, I will try it.

seventy7796 commented 1 year ago

@Eli-YiLi

Thank you for your reply. I follow your code and get the new results which are still not as good as yours. `all_texts = ['a person in the bench','person', 'bench'] target_texts = ['a person in the bench','person', 'bench'] sam_checkpoint = "sam_vit_h_4b8939.pth" model_type = "vit_h" sam = sam_model_registrymodel_type sam.to(device=device) predictor = SamPredictor(sam) predictor.set_image(np.array(pil_img)) with torch.no_grad():

clip architecture surgery acts on the image encoder

image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=1, keepdim=True)

# prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)

# extract redundant features from an empty string
redundant_features = clip.encode_text_with_prompt_ensemble(model, [""], device)

# apply feature surgery for single text
similarity = clip.clip_feature_surgery(image_features, text_features, redundant_features)[0]
# inference SAM with points from CLIP Surgery
for n in range(similarity.shape[-1]):
    if all_texts[n] not in target_texts:
        continue
    print('similarity',similarity.shape)
    points, labels = clip.similarity_map_to_points(similarity[1:, n], cv2_img.shape[:2], t=0.8)
    masks, scores, logits = predictor.predict(point_labels=labels, point_coords=np.array(points), multimask_output=True)
    mask = masks[np.argmax(scores)]
    mask = mask.astype('uint8')`

text1: person

text2: a person on the bench

text3: bench

Eli-YiLi commented 1 year ago

I update the demo.ipynb: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb In the last part, your case is satisfied.

For a whole sentence, some obvious texts may take the lead, thus overlook the rest texts (you can draw the similarity map). So, my suggestion is to treat different text individually, or adjust the threshold to pick points (e.g. 0.8->0.7)

xiexie123 commented 1 year ago

for question 3. there is no trainning referred to the papre

xmed-lab / CLIP_Surgery

About the results of Clip surgery with SAM #2

CLIP Surgery for a single text, without fixed label sets

clip architecture surgery acts on the image encoder

clip architecture surgery acts on the image encoder