Tolerance setting in the evaluation script

In the Spider2 evaluation script, the tolerance for comparing numerical values is set to 1e-3. This strict tolerance causes discrepancies when evaluating results with higher precision. For instance, in my case, the output from sf_bq_025 provides more precise percentages (e.g., 58.744924 for Uganda) compared to the gold answer, which rounds these values to two decimal places (e.g., 58.74). Although the more precise results are mathematically accurate, they are incorrectly judged as incorrect due to exceeding the tolerance threshold.

Notably, the question does not require retaining 2 decimal places for the percentage values.

xlang-ai / Spider2

Tolerance setting in the evaluation script #25