Severe Data Inconsistencies in Medical Judge Evaluation Files

zrobertson466920 commented 1 day ago

Issues Identified

Task Count Mismatches
- QWEN: 265 tasks for Q1-Q3
- GPT: 262 tasks for Q1-Q3
- Difference of 3 tasks consistently across questions
Human Response Mismatches
- Q1: 173 mismatches between QWEN-GPT
- Q2: 177 mismatches between QWEN-GPT
- Q3: 186 mismatches between QWEN-GPT
- Example verified: Task 76, Q1
  - QWEN: [-1, -1, 0, -1, 0]
  - GPT: [1, -1, -1, 0, -1]
- Verified misalignment at line 336 in both files

Impact

Cannot reliably compare model performance with current data
Over 60% of human responses show inconsistency between files
Results require significant caveats about data reliability

Root Cause Investigation Needed

Source of task count difference:
- Which 3 tasks are missing from GPT dataset?
- Were tasks filtered differently during collection?
Source of human response mismatches:
- Were responses recorded/processed differently?
- Possible data corruption during file creation?
- Task ID alignment issues?

Proposed Solutions

Immediate:
- Create aligned dataset using only matching task IDs
- Document discarded/mismatched entries
- Re-run analysis with cleaned data
Process:
- Add data integrity checks to collection pipeline
- Implement checksums for human response data
- Create standard task ID system across models

Next Steps

[ ] Generate full list of mismatched task IDs
[ ] Create validation script for future data collection
[ ] Document correct mapping between existing files
[ ] Consider re-collection of human evaluations

Labels: bug, data-quality, high-priority, needs-investigation

zrobertson466920 commented 1 day ago

Impact on Experimental Results

After implementing strict data alignment, our TVD-MI analysis shows significantly different patterns:

Pre-Alignment Results:

No clear pattern across questions (Q1-Q3)
Inconsistent human-LLM agreement trends
Noisy model-model comparisons

Post-Alignment Results:

Q1 (correctness) shows highest overall agreement (0.091 vs 0.049/0.035)
Human-human agreement strongest for Q1 (0.142 vs ~0.048)
Both LLMs show better alignment with humans on Q1/Q2 vs Q3
Model-model agreement follows similar pattern (0.044 → 0.003)

This demonstrates how data quality issues can mask genuine patterns in agreement metrics. The original hypothesis about TVD-MI correlating with correctness assessment only became clear after proper data alignment.

Key lesson: When measuring subtle effects in judge agreement, data consistency is crucial - even small misalignments can significantly impact results.

hansorlee commented 1 day ago

This issue should be resolved in commit d6b06e1.

zrobertson466920 / CS329_Project