yfzhang114 / LLaVA-Align

This is the official repo for Debiasing Large Visual Language Models, including a Post-Hoc debias method and Visual Debias Decoding strategy.
Apache License 2.0
71 stars 2 forks source link

二分类数据集align结果问题 #4

Closed veinhao closed 6 months ago

veinhao commented 6 months ago

你好,我在其他的二分类数据集上进行实验,因为代码中的二分类eval_pope_calibrate.py符合我二分类数据集的ALIGN情形,因此我按照

  1. 修改experiments/scripts/pope/run_llava.sh以适应我的实验设置,并得到结果jsonl
  2. 修改eval_pope_calibrate.py的args.gt_files,args.gen_files,得到输出值

其中我的问题是 {"question_id": 5, "image": "5_bin.jpg", "text": "does abdomen show hemorrhage secondary to ruptured aneurysm?", "category": "conv", "label": "no"} {"question_id": 11, "image": "11_bin.jpg", "text": "does omphalocele show a photo taken during life large lesion?", "category": "conv", "label": "no"} {"question_id": 14, "image": "14_bin.jpg", "text": "is the entire thickness of the epithelium characterize by a predominantly lymphocytic infiltrate?", "category": "conv", "label": "no"} {"question_id": 15, "image": "15_bin.jpg", "text": "is omentum present?", "category": "conv", "label": "no"} {"question_id": 28, "image": "28_bin.jpg", "text": "is an opened peritoneal cavity cause by fibrous band strangulation present?", "category": "conv", "label": "no"} align前的generation结果是

我现在的问题在于:Precision, Recall等ALIGN后的指标结果和GT的label有直接关系,当我问题的label全为 no 的时候align后的输出为: ****split popular**** Evaluate the performance in naive setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in none setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in unk setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence nan Evaluate the performance in none_unk setting F1: 0.0 Accuracy: 0.0 Precision: 0.0 Recall: 0.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0

当GT的label全为yes时align后Accuracy Precision Recall显示都为100: ****split popular**** Evaluate the performance in naive setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in none setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0 Evaluate the performance in unk setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence nan Evaluate the performance in none_unk setting F1: 100.0 Accuracy: 100.0 Precision: 100.0 Recall: 100.0 yes: 100.0 unknow: 0.0 number questions 5 confidence 0.0

我觉得align的效果不应该由标签的yes or no来决定的。