二分类数据集align结果问题

你好，我在其他的二分类数据集上进行实验，因为代码中的二分类eval_pope_calibrate.py符合我二分类数据集的ALIGN情形，因此我按照

修改experiments/scripts/pope/run_llava.sh以适应我的实验设置，并得到结果jsonl
修改eval_pope_calibrate.py的args.gt_files，args.gen_files，得到输出值

其中我的问题是 {"question_id": 5, "image": "5_bin.jpg", "text": "does abdomen show hemorrhage secondary to ruptured aneurysm?", "category": "conv", "label": "no"} {"question_id": 11, "image": "11_bin.jpg", "text": "does omphalocele show a photo taken during life large lesion?", "category": "conv", "label": "no"} {"question_id": 14, "image": "14_bin.jpg", "text": "is the entire thickness of the epithelium characterize by a predominantly lymphocytic infiltrate?", "category": "conv", "label": "no"} {"question_id": 15, "image": "15_bin.jpg", "text": "is omentum present?", "category": "conv", "label": "no"} {"question_id": 28, "image": "28_bin.jpg", "text": "is an opened peritoneal cavity cause by fibrous band strangulation present?", "category": "conv", "label": "no"} align前的generation结果是

我现在的问题在于：Precision， Recall等ALIGN后的指标结果和GT的label有直接关系，当我问题的label全为 no 的时候align后的输出为： ****split popular**** Evaluate the performance in naive setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in none setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in unk setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence nan Evaluate the performance in none_unk setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0

当GT的label全为yes时align后Accuracy Precision Recall显示都为100： ****split popular**** Evaluate the performance in naive setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in none setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in unk setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence nan Evaluate the performance in none_unk setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0

我觉得align的效果不应该由标签的yes or no来决定的。

yfzhang114 / LLaVA-Align

二分类数据集align结果问题 #4