yanolja / arena

Apache License 2.0
3 stars 1 forks source link

integration of automatic translation evaluation into model evaluation tools #32

Open kangsuhyun-yanolja opened 6 months ago

kangsuhyun-yanolja commented 6 months ago

Currently, there is a need for an automated evaluation tool that can simplify the process. This tool should be capable of assessing the accuracy and quality of translations produced by various models. A potential solution could involve integrating this functionality into an existing framework like lm_evaluation_harness or creating a standalone service. This service could accept inputs in formats such as CSV or JSONL, providing users with a straightforward method to obtain evaluations. Results from this tool could be essential for models aiming to participate in Arena.

kangsuhyun-yanolja commented 6 months ago

@hist0613 Hello. May I ask how could we run the automatic evaluation of translation models?

hist0613 commented 6 months ago

@kangsuhyun-yanolja

  1. My work is evaluating the model's translations based on the given references (gold translations), which I called thus reference-based.
  2. You can refer to the repo (yanolja-org/iab-eval-translation)
  3. Among the files, it is recommended to look run_evaluation.py. It works as follows: a. It supposes two translation files, one is for gold translation, and the other for the translation system to be evaluated, as in ./translations/ai-hub-ko-en b. You can evaluate a given translation file (./translations/ai-hub-ko-en/deepl.jsonl) for a given metric (such as BLEU), as seen in this shell script ./scripts/evaluate.sh c. You can see the evaluation results at ./results/eval_results.json by default, or at ./results/eval_results.md when you runned the visualization script together.
kangsuhyun-yanolja commented 6 months ago

@hist0613 Thank you for the detailed explanation!

kimsooyeon-yanolja commented 5 months ago

@kangsuhyun-yanolja
It would be nice to expand some function within the translation part. To enhance the usability of translation UI, I think that (1) adding an alert function and (2) a reset button.

  1. Adding alert func 1-1) If the source language and the target language are the same (if the value of the drop-down is the same) [Alert] 'Source language and target language are the same.' image

    1-2) If the source code is different from the sentence in the Prompt [Alert] 'The language specified by the source code is different.'

    image
  2. Set the reset button to allow multiple attempts : It looks like you need a separate button to clear the prompt or try Run again.

cc. @seungduk-yanolja

kangsuhyun-yanolja commented 5 months ago

Thank you for the comment! Especially the 1-2 item, I think we need to handle 1-2 now. We're using a language detector so it would be better to believe it. Then users won't have to select two options. I'll create an issue about it.