EvalAI: Towards Better Evaluation of AI Agents

Kaggle doesn’t support custom evaluation metrics and multiple challenge phases --- a common practice in popular challenges like COCO Caption Challenge, VQA etc. CodaLab provides an open-source alternative to Kaggle and fixes several of their limitations but doesn’t support evaluating interactive agents in dynamic environments. AICrowd something doesn’t support human-in-the loop evaluation of prediction based or code-upload based challenges. ParlAI is not a challenge hosting platform and only supports evaluation of dialog models. OpenAI gym is not a dedicated evaluation platform and lacks support for prediction based challenges, custom evaluation protocol, and human-in-the loop evaluation.

EvalAI Feature / difference:

Human-in-the-loop evaluation of machine learning models (by integrating with Amazon Mechanical Turk; AMT).
The ability to run user’s code in a dynamic environment instead of a static dataset enabling the evaluation of interactive agents.

zchen0420 / nn_papers

Competition Platforms #13

EvalAI: Towards Better Evaluation of AI Agents