Open chunyangx opened 1 week ago
We follow the processing strategy outlined in Video-MME. In practice, if the model has strong instruction-following capabilities, it typically won't generate the preceding analysis, so the impact on the results is minimal. However, if the model you're testing frequently generates analytical content, I recommend using the following evaluation code to skip the prefix portion.
import re
def extract_characters_regex(s, choices):
s = s.strip()
answer_prefixes = [
"The best answer is",
"The correct answer is",
"The answer is",
"The answer",
"The best option is",
"The correct option is",
"Best answer:",
"Best option:",
"Answer:",
"Option:",
"The correct answer",
"The correct option",
]
# Find the text after any of the answer prefixes
for answer_prefix in answer_prefixes:
prefix_pattern = re.escape(answer_prefix)
match = re.search(prefix_pattern, s, re.IGNORECASE)
if match:
s = s[match.end():].strip()
break # Exit the loop once the relevant prefix is found
# After removing the prefix, continue with the existing logic
if len(s.split()) > 10 and not re.search("[ABCDE]", s):
return ""
matches = re.search(r'[ABCDE]', s)
if matches is None:
for choice in choices:
if s.lower() in choice.lower():
return choice[1]
return ""
return matches[0]
The evaluation script significantly under-estimates model performances.
The script just extract the first A, B, C, D, E and treat that as the answer which can result in many mistakes for lengthy output such as "The Cash Flow is read from the diagram, so the answer is (D)".