Closed wtc9806 closed 4 months ago
Thanks for the issue. It seems ROUGE-L (with beta=5) is missing. We will add ROUGE-L evaluation to this repository soon.
I used the img_align method in Metrics.py to calculate the similarity between the generated image and the real image, but the index I got was 1.28, which is about twice the performance of the PR+ICL configuration in the original paper. Is there something wrong with what I wrote? My test code is as follows: def main(args): model, processor = clip.load("ViT-L/14", device=args.device) result_lst = os.listdir(args.data_path) img_align_score = [] pms_score = [] rouge_l_score = [] error_lst = []
for item in tqdm(result_lst):
path = os.path.join(args.data_path, item)
ori_path = os.path.join(path, 'ori.png')
gen_path = os.path.join(path, 'generated.png')
text_path = os.path.join(path, 'result.txt')
if not os.path.exists(ori_path) or not os.path.exists(gen_path):
error_path = ori_path + ' ' + gen_path
error_lst.append(error_path)
continue
else:
ori_img = read_img(ori_path)
gen_img = read_img(gen_path)
text = read_txt(text_path)
img_align_score.append(img_align(model, processor, ori_img, gen_img, device=args.device))
pms_score.append(clip_score(model, processor, gen_img, text['new_prompt'], device=args.device))
mean_align = sum(img_align_score) / len(img_align_score)
print('mean_align = ',mean_align)
with open(args.error_path, "w") as f:
f.writelines(error_lst)
For image metrics, there are two differences in your code, one of which is due to a problem in my code in "metrics.py". First, the difference is mainly because of a wrong coefficient in 'img_align' function: the similarity should not be multiplied by 2.5. It should look like this, and we'll fix this soon.
@torch.no_grad()
def img_align(model,preprocess,ori_image,gen_image,device='cuda'):
'''
Implementation of Image-Align metric proposed, which calculates the similarity
between ground truth image and generated image.
Image-Align uses CLIP ViT-B/32 model
Args:
model: CLIP model
preprocess: CLIP image preprocess
ori_image: ground truth image
gem_image: generated image
device: specify if not 'cuda'
'''
ori_image = preprocess(ori_image).unsqueeze(0).to(device)
gen_image = preprocess(gen_image).unsqueeze(0).to(device)
ori_features = model.encode_image(ori_image)
gen_features = model.encode_image(gen_image)
gen_features /= gen_features.norm(dim=-1, keepdim=True)
ori_features /= ori_features.norm(dim=-1, keepdim=True)
score=1.0*(ori_features @ gen_features.T)
if score<0:
score=0.0
return score
You are using CLIP-VIT-L/14 as the evaluation model, while we use CLIP-VIT-B/32 throughout our evaluation. Generally this might be OK as long as you use the same model for all evaluations, and should only result in slight differences.
Thanks again for your issue. We will fix the 'img_align' function and add the modified ROUGE-L calculation in 'metrics.py' soon.
"metrics.py" is updated.
Hi, authors I found that the metric code is not complete. Can you provide the complete test code?