Closed seventy7796 closed 1 year ago
Note that, the demo gives a label set, and CLIP Surgery computes the redundant features from this label set.
I guess you just use a sentence without a label set. In such condition, you should follow the example of for a single text:
texts = ['shoes']
with torch.no_grad():
image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=1, keepdim=True)
# prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, texts, device)
# extract redundant features from an empty string
redundant_features = clip.encode_text_with_prompt_ensemble(model, [""], device)
# apply feature surgery for single text
similarity = clip.clip_feature_surgery(image_features, text_features, redundant_features)
similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2])
This example use the empty string as a redundant feature, and then you can use the similarity map to get points for SAM.
Your attempt also shows the necessity of feature surgery, without it the noises lead to false points.
For multiple words, I will try it.
@Eli-YiLi
Thank you for your reply. I follow your code and get the new results which are still not as good as yours. `all_texts = ['a person in the bench','person', 'bench'] target_texts = ['a person in the bench','person', 'bench'] sam_checkpoint = "sam_vit_h_4b8939.pth" model_type = "vit_h" sam = sam_model_registrymodel_type sam.to(device=device) predictor = SamPredictor(sam) predictor.set_image(np.array(pil_img)) with torch.no_grad():
image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=1, keepdim=True)
# prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)
# extract redundant features from an empty string
redundant_features = clip.encode_text_with_prompt_ensemble(model, [""], device)
# apply feature surgery for single text
similarity = clip.clip_feature_surgery(image_features, text_features, redundant_features)[0]
# inference SAM with points from CLIP Surgery
for n in range(similarity.shape[-1]):
if all_texts[n] not in target_texts:
continue
print('similarity',similarity.shape)
points, labels = clip.similarity_map_to_points(similarity[1:, n], cv2_img.shape[:2], t=0.8)
masks, scores, logits = predictor.predict(point_labels=labels, point_coords=np.array(points), multimask_output=True)
mask = masks[np.argmax(scores)]
mask = mask.astype('uint8')`
text1: person
text2: a person on the bench
text3: bench
I update the demo.ipynb: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb In the last part, your case is satisfied.
For a whole sentence, some obvious texts may take the lead, thus overlook the rest texts (you can draw the similarity map). So, my suggestion is to treat different text individually, or adjust the threshold to pick points (e.g. 0.8->0.7)
for question 3. there is no trainning referred to the papre
Could you help me find my errors?
Thanks a lot.