zhjohnchan / R2GenCMN

[ACL-2021] The official implementation of Cross-modal Memory Networks for Radiology Report Generation.
Apache License 2.0
77 stars 7 forks source link

The provided model can reproduce the NLG results, but cannot reproduce the CE results. #7

Closed mengweiwang closed 1 year ago

mengweiwang commented 1 year ago

I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.

The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.

Specifically, in terms of precision, it can achieve the results of the paper.

However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.

I suspect that this discrepancy may be due to differences in our calculation methods.

Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.

What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?

Thank you very much and looking forward to hearing from you.

My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411

The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278

My calculation code is as follows:

import pandas as pd
import json
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score
from tqdm import tqdm
import random

# The report labels were generated using Chexbert. The even rows are keys, and the odd rows are their corresponding label values.
pred_path = 'test_generate_reports_chexbert.csv'
gt_path = 'test_true_reports_chexbert.csv'

gt_dict = {}
df = pd.read_csv(gt_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    gt_dict[key] = labels_list

pred_dict = {}
df = pd.read_csv(pred_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    pred_dict[key] = labels_list

# Record the true labels and predicted labels.
labels_list, pred_labels_list = [], []
f1_score_list, precision_list, recall_list = [], [], []

# Traverse pred_dict
for idx in tqdm(pred_dict):
    pred_labels = pred_dict[idx]
    # Get the true labels.
    labels = gt_dict[idx]
    # Change labels to floating point numbers.
    labels = list(map(float, labels))
    # Change pred_labels to floating point numbers.
    pred_labels = list(map(float, pred_labels))
    # Record the true labels and predicted labels.
    labels_list.append(labels)
    pred_labels_list.append(pred_labels)

#Calculate clinical indicators using the recorded true labels and predicted labels.
labels = np.asarray(labels_list)
pred_labels = np.asarray(pred_labels_list)
f1_scores_macro = f1_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
precision_scores_macro = precision_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
recall_scores_macro = recall_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
f1_scores_micro = f1_score(y_true=labels, y_pred=pred_labels, average="micro")
precision_scores_micro = precision_score(y_true=labels, y_pred=pred_labels, average="micro")
recall_scores_micro = recall_score(y_true=labels, y_pred=pred_labels, average="micro")

predictions_info = ""
predictions_info += "Macro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_macro, precision_scores_macro, recall_scores_macro)
predictions_info += "Micro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_micro, precision_scores_micro, recall_scores_micro)

print(predictions_info)
Liqq1 commented 1 year ago

您好,抱歉打扰,想问一下,你是否用作者提供的权重测试过模型指标(bash test_iu_xray.sh) ,是否出现过state_dict不对齐的情况,如#issue8所述。

mengweiwang commented 1 year ago

@Liqq1

您好,抱歉打扰,想问一下,你是否用作者提供的权重测试过模型指标(bash test_iu_xray.sh) ,是否出现过state_dict不对齐的情况,如#issue8所述。

我没有遇到这个问题, 我是可以直接运行的, 应该是你的调试出现了问题.

Otiss-pang commented 1 year ago

I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.

The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.

Specifically, in terms of precision, it can achieve the results of the paper.

However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.

I suspect that this discrepancy may be due to differences in our calculation methods.

Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.

What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?

Thank you very much and looking forward to hearing from you.

My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411

The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278

My calculation code is as follows:

import pandas as pd
import json
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score
from tqdm import tqdm
import random

# The report labels were generated using Chexbert. The even rows are keys, and the odd rows are their corresponding label values.
pred_path = 'test_generate_reports_chexbert.csv'
gt_path = 'test_true_reports_chexbert.csv'

gt_dict = {}
df = pd.read_csv(gt_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    gt_dict[key] = labels_list

pred_dict = {}
df = pd.read_csv(pred_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    pred_dict[key] = labels_list

# Record the true labels and predicted labels.
labels_list, pred_labels_list = [], []
f1_score_list, precision_list, recall_list = [], [], []

# Traverse pred_dict
for idx in tqdm(pred_dict):
    pred_labels = pred_dict[idx]
    # Get the true labels.
    labels = gt_dict[idx]
    # Change labels to floating point numbers.
    labels = list(map(float, labels))
    # Change pred_labels to floating point numbers.
    pred_labels = list(map(float, pred_labels))
    # Record the true labels and predicted labels.
    labels_list.append(labels)
    pred_labels_list.append(pred_labels)

#Calculate clinical indicators using the recorded true labels and predicted labels.
labels = np.asarray(labels_list)
pred_labels = np.asarray(pred_labels_list)
f1_scores_macro = f1_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
precision_scores_macro = precision_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
recall_scores_macro = recall_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
f1_scores_micro = f1_score(y_true=labels, y_pred=pred_labels, average="micro")
precision_scores_micro = precision_score(y_true=labels, y_pred=pred_labels, average="micro")
recall_scores_micro = recall_score(y_true=labels, y_pred=pred_labels, average="micro")

predictions_info = ""
predictions_info += "Macro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_macro, precision_scores_macro, recall_scores_macro)
predictions_info += "Micro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_micro, precision_scores_micro, recall_scores_micro)

print(predictions_info)

你好,我想问问您在计算分数时,标签中的NAN和-1值的处理是根据哪一篇论文来处理的? 期待您的回复。

Otiss-pang commented 1 year ago

I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.

The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.

Specifically, in terms of precision, it can achieve the results of the paper.

However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.

I suspect that this discrepancy may be due to differences in our calculation methods.

Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.

What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?

Thank you very much and looking forward to hearing from you.

My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411

The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278

My calculation code is as follows:

import pandas as pd
import json
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score
from tqdm import tqdm
import random

# The report labels were generated using Chexbert. The even rows are keys, and the odd rows are their corresponding label values.
pred_path = 'test_generate_reports_chexbert.csv'
gt_path = 'test_true_reports_chexbert.csv'

gt_dict = {}
df = pd.read_csv(gt_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    gt_dict[key] = labels_list

pred_dict = {}
df = pd.read_csv(pred_path)
for i in tqdm(range(0, df.shape[0], 2)):
    key = df.iloc[i, 0] # Obtain the key.
    value = df.iloc[i+1, 1:] # Obtain the value.
    labels_list = value.tolist()
    # Replace nan with 0.0.
    labels_list = [0.0 if np.isnan(x) else x for x in labels_list]
    # Replace -1 with 0.0.
    labels_list = [0.0 if x == -1 else x for x in labels_list]
    pred_dict[key] = labels_list

# Record the true labels and predicted labels.
labels_list, pred_labels_list = [], []
f1_score_list, precision_list, recall_list = [], [], []

# Traverse pred_dict
for idx in tqdm(pred_dict):
    pred_labels = pred_dict[idx]
    # Get the true labels.
    labels = gt_dict[idx]
    # Change labels to floating point numbers.
    labels = list(map(float, labels))
    # Change pred_labels to floating point numbers.
    pred_labels = list(map(float, pred_labels))
    # Record the true labels and predicted labels.
    labels_list.append(labels)
    pred_labels_list.append(pred_labels)

#Calculate clinical indicators using the recorded true labels and predicted labels.
labels = np.asarray(labels_list)
pred_labels = np.asarray(pred_labels_list)
f1_scores_macro = f1_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
precision_scores_macro = precision_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
recall_scores_macro = recall_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0)
f1_scores_micro = f1_score(y_true=labels, y_pred=pred_labels, average="micro")
precision_scores_micro = precision_score(y_true=labels, y_pred=pred_labels, average="micro")
recall_scores_micro = recall_score(y_true=labels, y_pred=pred_labels, average="micro")

predictions_info = ""
predictions_info += "Macro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_macro, precision_scores_macro, recall_scores_macro)
predictions_info += "Micro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format(
    f1_scores_micro, precision_scores_micro, recall_scores_micro)

print(predictions_info)

by the way,您在计算最后的P、R、F1分数时,使用的marco分数还是micro分数?

mengweiwang commented 1 year ago

@Otiss-pang 大家通常是把 NAN 作为 0,把 -1 作为 1 处理。计算 F1,P,R 部分论文是只有 marco 分数,部分论文是 marco 和 micro 分数。