Display: For code generation view, add "correct/incorrect" labels (and potentially execution outputs)

In the code generation view, currently the output code and expected code are displayed.

However, in most code generation datasets, such as HumanEval or Odex, evaluation is performed based on running the code and generating the output and comparing whether the output is correct. Based on this:

At the very least, it should be possible to view whether the generated code was judged as correct by showing a "correct/incorrect" label.
Even better would be the functionality to view:
1. Expected code
2. Predicted code
3. Output of expected code
4. Output of predicted code
5. Correctness/incorrectness value
This probably requires the data structure for code output_column and data_column to not be str, but a different data structure that includes the code, output, and correctness value.

zeno-ml / zeno

Display: For code generation view, add "correct/incorrect" labels (and potentially execution outputs) #813