In the Spider2 evaluation script, the tolerance for comparing numerical values is set to 1e-3. This strict tolerance causes discrepancies when evaluating results with higher precision. For instance, in my case, the output from sf_bq_025 provides more precise percentages (e.g., 58.744924 for Uganda) compared to the gold answer, which rounds these values to two decimal places (e.g., 58.74). Although the more precise results are mathematically accurate, they are incorrectly judged as incorrect due to exceeding the tolerance threshold.
Notably, the question does not require retaining 2 decimal places for the percentage values.
Thanks! We have updated the 1e-3 in the evaluation scripts to 1e-2. In the paper experiments, we used 1e-2, but we forgot to update it in the public repo.
In the Spider2 evaluation script, the tolerance for comparing numerical values is set to 1e-3. This strict tolerance causes discrepancies when evaluating results with higher precision. For instance, in my case, the output from sf_bq_025 provides more precise percentages (e.g., 58.744924 for Uganda) compared to the gold answer, which rounds these values to two decimal places (e.g., 58.74). Although the more precise results are mathematically accurate, they are incorrectly judged as incorrect due to exceeding the tolerance threshold.
Notably, the question does not require retaining 2 decimal places for the percentage values.