zrobertson466920 / CS329_Project

0 stars 0 forks source link

Severe Data Inconsistencies in Medical Judge Evaluation Files #2

Open zrobertson466920 opened 1 day ago

zrobertson466920 commented 1 day ago

Issues Identified

  1. Task Count Mismatches

    • QWEN: 265 tasks for Q1-Q3
    • GPT: 262 tasks for Q1-Q3
    • Difference of 3 tasks consistently across questions
  2. Human Response Mismatches

    • Q1: 173 mismatches between QWEN-GPT
    • Q2: 177 mismatches between QWEN-GPT
    • Q3: 186 mismatches between QWEN-GPT
    • Example verified: Task 76, Q1
      • QWEN: [-1, -1, 0, -1, 0]
      • GPT: [1, -1, -1, 0, -1]
    • Verified misalignment at line 336 in both files

Impact

Root Cause Investigation Needed

  1. Source of task count difference:

    • Which 3 tasks are missing from GPT dataset?
    • Were tasks filtered differently during collection?
  2. Source of human response mismatches:

    • Were responses recorded/processed differently?
    • Possible data corruption during file creation?
    • Task ID alignment issues?

Proposed Solutions

  1. Immediate:

    • Create aligned dataset using only matching task IDs
    • Document discarded/mismatched entries
    • Re-run analysis with cleaned data
  2. Process:

    • Add data integrity checks to collection pipeline
    • Implement checksums for human response data
    • Create standard task ID system across models

Next Steps

Labels: bug, data-quality, high-priority, needs-investigation

zrobertson466920 commented 1 day ago

Impact on Experimental Results

After implementing strict data alignment, our TVD-MI analysis shows significantly different patterns:

Pre-Alignment Results:

Post-Alignment Results:

This demonstrates how data quality issues can mask genuine patterns in agreement metrics. The original hypothesis about TVD-MI correlating with correctness assessment only became clear after proper data alignment.

Key lesson: When measuring subtle effects in judge agreement, data consistency is crucial - even small misalignments can significantly impact results.

hansorlee commented 1 day ago

This issue should be resolved in commit d6b06e1.