Our current evaluation assumes the model knows that some tasks are unachievable and will produce N/A in such cases. We should make this more flexible, if the model provides a text-based answer describing why it stops, we should make the answer with the annotated reason and judge accordingly.
Our current evaluation assumes the model knows that some tasks are unachievable and will produce N/A in such cases. We should make this more flexible, if the model provides a text-based answer describing why it stops, we should make the answer with the annotated reason and judge accordingly.