The score is calculated against incorrect number of examples. Whether there's an output for an example or not, the total number of examples should be the number of test examples (260) instead of len(output_results).
This PR catches potential CSV parsing errors, which can cause the evaluation script to fail.
This PR fixes two problems:
len(output_results)
.This PR also adds
examples
to.gitignore
.