Closed mengweiwang closed 1 year ago
您好,抱歉打扰,想问一下,你是否用作者提供的权重测试过模型指标(bash test_iu_xray.sh
) ,是否出现过state_dict不对齐的情况,如#issue8所述。
@Liqq1
您好,抱歉打扰,想问一下,你是否用作者提供的权重测试过模型指标(
bash test_iu_xray.sh
) ,是否出现过state_dict不对齐的情况,如#issue8所述。
我没有遇到这个问题, 我是可以直接运行的, 应该是你的调试出现了问题.
I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.
The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.
Specifically, in terms of precision, it can achieve the results of the paper.
However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.
I suspect that this discrepancy may be due to differences in our calculation methods.
Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.
What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?
Thank you very much and looking forward to hearing from you.
My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411
The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278
My calculation code is as follows:
import pandas as pd import json import numpy as np from sklearn.metrics import f1_score, precision_score, recall_score from tqdm import tqdm import random # The report labels were generated using Chexbert. The even rows are keys, and the odd rows are their corresponding label values. pred_path = 'test_generate_reports_chexbert.csv' gt_path = 'test_true_reports_chexbert.csv' gt_dict = {} df = pd.read_csv(gt_path) for i in tqdm(range(0, df.shape[0], 2)): key = df.iloc[i, 0] # Obtain the key. value = df.iloc[i+1, 1:] # Obtain the value. labels_list = value.tolist() # Replace nan with 0.0. labels_list = [0.0 if np.isnan(x) else x for x in labels_list] # Replace -1 with 0.0. labels_list = [0.0 if x == -1 else x for x in labels_list] gt_dict[key] = labels_list pred_dict = {} df = pd.read_csv(pred_path) for i in tqdm(range(0, df.shape[0], 2)): key = df.iloc[i, 0] # Obtain the key. value = df.iloc[i+1, 1:] # Obtain the value. labels_list = value.tolist() # Replace nan with 0.0. labels_list = [0.0 if np.isnan(x) else x for x in labels_list] # Replace -1 with 0.0. labels_list = [0.0 if x == -1 else x for x in labels_list] pred_dict[key] = labels_list # Record the true labels and predicted labels. labels_list, pred_labels_list = [], [] f1_score_list, precision_list, recall_list = [], [], [] # Traverse pred_dict for idx in tqdm(pred_dict): pred_labels = pred_dict[idx] # Get the true labels. labels = gt_dict[idx] # Change labels to floating point numbers. labels = list(map(float, labels)) # Change pred_labels to floating point numbers. pred_labels = list(map(float, pred_labels)) # Record the true labels and predicted labels. labels_list.append(labels) pred_labels_list.append(pred_labels) #Calculate clinical indicators using the recorded true labels and predicted labels. labels = np.asarray(labels_list) pred_labels = np.asarray(pred_labels_list) f1_scores_macro = f1_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0) precision_scores_macro = precision_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0) recall_scores_macro = recall_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0) f1_scores_micro = f1_score(y_true=labels, y_pred=pred_labels, average="micro") precision_scores_micro = precision_score(y_true=labels, y_pred=pred_labels, average="micro") recall_scores_micro = recall_score(y_true=labels, y_pred=pred_labels, average="micro") predictions_info = "" predictions_info += "Macro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format( f1_scores_macro, precision_scores_macro, recall_scores_macro) predictions_info += "Micro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format( f1_scores_micro, precision_scores_micro, recall_scores_micro) print(predictions_info)
你好,我想问问您在计算分数时,标签中的NAN和-1值的处理是根据哪一篇论文来处理的? 期待您的回复。
I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.
The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.
Specifically, in terms of precision, it can achieve the results of the paper.
However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.
I suspect that this discrepancy may be due to differences in our calculation methods.
Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.
What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?
Thank you very much and looking forward to hearing from you.
My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411
The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278
My calculation code is as follows:
import pandas as pd import json import numpy as np from sklearn.metrics import f1_score, precision_score, recall_score from tqdm import tqdm import random # The report labels were generated using Chexbert. The even rows are keys, and the odd rows are their corresponding label values. pred_path = 'test_generate_reports_chexbert.csv' gt_path = 'test_true_reports_chexbert.csv' gt_dict = {} df = pd.read_csv(gt_path) for i in tqdm(range(0, df.shape[0], 2)): key = df.iloc[i, 0] # Obtain the key. value = df.iloc[i+1, 1:] # Obtain the value. labels_list = value.tolist() # Replace nan with 0.0. labels_list = [0.0 if np.isnan(x) else x for x in labels_list] # Replace -1 with 0.0. labels_list = [0.0 if x == -1 else x for x in labels_list] gt_dict[key] = labels_list pred_dict = {} df = pd.read_csv(pred_path) for i in tqdm(range(0, df.shape[0], 2)): key = df.iloc[i, 0] # Obtain the key. value = df.iloc[i+1, 1:] # Obtain the value. labels_list = value.tolist() # Replace nan with 0.0. labels_list = [0.0 if np.isnan(x) else x for x in labels_list] # Replace -1 with 0.0. labels_list = [0.0 if x == -1 else x for x in labels_list] pred_dict[key] = labels_list # Record the true labels and predicted labels. labels_list, pred_labels_list = [], [] f1_score_list, precision_list, recall_list = [], [], [] # Traverse pred_dict for idx in tqdm(pred_dict): pred_labels = pred_dict[idx] # Get the true labels. labels = gt_dict[idx] # Change labels to floating point numbers. labels = list(map(float, labels)) # Change pred_labels to floating point numbers. pred_labels = list(map(float, pred_labels)) # Record the true labels and predicted labels. labels_list.append(labels) pred_labels_list.append(pred_labels) #Calculate clinical indicators using the recorded true labels and predicted labels. labels = np.asarray(labels_list) pred_labels = np.asarray(pred_labels_list) f1_scores_macro = f1_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0) precision_scores_macro = precision_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0) recall_scores_macro = recall_score(y_true=labels, y_pred=pred_labels, average="macro",zero_division=0) f1_scores_micro = f1_score(y_true=labels, y_pred=pred_labels, average="micro") precision_scores_micro = precision_score(y_true=labels, y_pred=pred_labels, average="micro") recall_scores_micro = recall_score(y_true=labels, y_pred=pred_labels, average="micro") predictions_info = "" predictions_info += "Macro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format( f1_scores_macro, precision_scores_macro, recall_scores_macro) predictions_info += "Micro: F1 score: {:.4f}, Precision: {:.4f}, Recall: {:.4f}\n".format( f1_scores_micro, precision_scores_micro, recall_scores_micro) print(predictions_info)
by the way,您在计算最后的P、R、F1分数时,使用的marco分数还是micro分数?
@Otiss-pang 大家通常是把 NAN 作为 0,把 -1 作为 1 处理。计算 F1,P,R 部分论文是只有 marco 分数,部分论文是 marco 和 micro 分数。
I have reproduced your papers R2Gen and R2GenCMN using the models and datasets provided by you.
The NLG metrics achieved in my reproduction are consistent with those reported in the paper, but I cannot achieve the same results for CE metric, neither for R2Gen nor for R2GenCMN.
Specifically, in terms of precision, it can achieve the results of the paper.
However, in terms of recall, neither R2Gen nor R2GenCMN can reach the results of the paper and there is a significant gap.
I suspect that this discrepancy may be due to differences in our calculation methods.
Currently, my approach involves only calculating positive (1) CE scores and assigning a value of 0 to NAN or uncertain (-1) values during computation.
What is your calculation method? Why am I unable to reproduce your CE metric results as reported in the paper?
Thank you very much and looking forward to hearing from you.
My calculation results are as follows: Precision: 0.3494 Recall: 0.2348 F1: 0.2411
The report results of the paper are as follows: Precision: 0.334 Recall: 0.275 F1: 0.278
My calculation code is as follows: