您好,想问下,论文中图1用clip计算的结果怎么得到的,为什么我用clip计算的结果和你的正相反,我计算的clip分数依次为:[0.63 0.37]、[0.539 0.461],和论文中[0.401,0.599]、[0.289,0.711]相比,得出的结论似乎刚好相反,下面是我的测试代码:
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("dress.png")).unsqueeze(0).to(device)
text = clip.tokenize(["A blue dress and a red book","A red dress and a blue book"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
您好,想问下,论文中图1用clip计算的结果怎么得到的,为什么我用clip计算的结果和你的正相反,我计算的clip分数依次为:[0.63 0.37]、[0.539 0.461],和论文中[0.401,0.599]、[0.289,0.711]相比,得出的结论似乎刚好相反,下面是我的测试代码: import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) image = preprocess(Image.open("dress.png")).unsqueeze(0).to(device) text = clip.tokenize(["A blue dress and a red book","A red dress and a blue book"]).to(device)
with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)
print("Label probs:", probs)