The provided model can reproduce the NLG results, but cannot reproduce the CE results.

I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.

The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.

Specifically, in terms of precision, it can achieve the results of the paper.

However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.

I suspect that this discrepancy may be due to differences in our calculation methods.

Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.

What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?

Thank you very much and looking forward to hearing from you.

My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411

The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278

My calculation code is as follows:

import pandas as pd
import json
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score
from tqdm import tqdm
import random

# The report labels were generated using Chexbert. The even rows are keys, and the odd rows are their corresponding label values.
pred_path = 'test_generate_reports_chexbert.csv'
gt_path = 'test_true_reports_chexbert.csv'

gt_dict = {}
df = pd.read_csv(gt_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    gt_dict[key] = labels_list

pred_dict = {}
df = pd.read_csv(pred_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    pred_dict[key] = labels_list

# Record the true labels and predicted labels.
labels_list, pred_labels_list = [], []
f1_score_list, precision_list, recall_list = [], [], []

# Traverse pred_dict
for idx in tqdm(pred_dict):
    pred_labels = pred_dict[idx]
    # Get the true labels.
    labels = gt_dict[idx]
    # Change labels to floating point numbers.
    labels = list(map(float, labels))
    # Change pred_labels to floating point numbers.
    pred_labels = list(map(float, pred_labels))
    # Record the true labels and predicted labels.
    labels_list.append(labels)
    pred_labels_list.append(pred_labels)

#Calculate clinical indicators using the recorded true labels and predicted labels.
labels = np.asarray(labels_list)
pred_labels = np.asarray(pred_labels_list)
f1_scores_macro = f1_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
precision_scores_macro = precision_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
recall_scores_macro = recall_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
f1_scores_micro = f1_score(y_true=labels, y_pred=pred_labels, average="micro")
precision_scores_micro = precision_score(y_true=labels, y_pred=pred_labels, average="micro")
recall_scores_micro = recall_score(y_true=labels, y_pred=pred_labels, average="micro")

predictions_info = ""
predictions_info += "Macro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_macro, precision_scores_macro, recall_scores_macro)
predictions_info += "Micro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_micro, precision_scores_micro, recall_scores_micro)

print(predictions_info)

您好，抱歉打扰，想问一下，你是否用作者提供的权重测试过模型指标(bash test_iu_xray.sh) ,是否出现过state_dict不对齐的情况，如#issue8所述。

@Liqq1

您好，抱歉打扰，想问一下，你是否用作者提供的权重测试过模型指标(bash test_iu_xray.sh) ,是否出现过state_dict不对齐的情况，如#issue8所述。

我没有遇到这个问题, 我是可以直接运行的, 应该是你的调试出现了问题.