yfzhang114 / MME-RealWorld

✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
64 stars 5 forks source link

buggy evaluation script #1

Open chunyangx opened 1 week ago

chunyangx commented 1 week ago

The evaluation script significantly under-estimates model performances.

The script just extract the first A, B, C, D, E and treat that as the answer which can result in many mistakes for lengthy output such as "The Cash Flow is read from the diagram, so the answer is (D)".

yfzhang114 commented 1 week ago

We follow the processing strategy outlined in Video-MME. In practice, if the model has strong instruction-following capabilities, it typically won't generate the preceding analysis, so the impact on the results is minimal. However, if the model you're testing frequently generates analytical content, I recommend using the following evaluation code to skip the prefix portion.

import re

def extract_characters_regex(s, choices):
    s = s.strip()
    answer_prefixes = [
        "The best answer is",
        "The correct answer is",
        "The answer is",
        "The answer",
        "The best option is",
        "The correct option is",
        "Best answer:",
        "Best option:",
        "Answer:",
        "Option:",
        "The correct answer",
        "The correct option",
    ]

    # Find the text after any of the answer prefixes
    for answer_prefix in answer_prefixes:
        prefix_pattern = re.escape(answer_prefix)
        match = re.search(prefix_pattern, s, re.IGNORECASE)
        if match:
            s = s[match.end():].strip()
            break  # Exit the loop once the relevant prefix is found

    # After removing the prefix, continue with the existing logic
    if len(s.split()) > 10 and not re.search("[ABCDE]", s):
        return ""

    matches = re.search(r'[ABCDE]', s)
    if matches is None:
        for choice in choices:
            if s.lower() in choice.lower():
                return choice[1]
        return ""

    return matches[0]