Evaluation specifics - Githubissues

piotrmigdalek commented 3 months ago

Hi!

I'm trying to evaluate Mistral-7b based model with custom locality and portability data. For each of 50 edits I have 6 locality prompts and 2 portability ones.

How should I arange the dicts to feed them into an edit function in that case? Will the variable below feeded to portability_inputs work as intended?

portability_inputs = {
    'english': {
        'prompt': df_port['question_en'].tolist(),
        'ground_truth': df_port['label_en'].tolist()
    },
    'polish': {
        'prompt': df_port['question_pl'].tolist(),
        'ground_truth': df_port['label_pl'].tolist()
    }
}

And a technical one, are the metrics calculated after each edit? If yes, is there an option to evaluate everything on the final model after 50 sequential edits?

Thank you :)

pengzju commented 3 months ago

Q1:

Your usage is correct; just ensure that the number of items in the prompts and ground_truth under each dimension, such as "english" and "polish," are consistent.
You can also check if the number of metrics recorded in the logs matches the number of input prompts.

Q2:

I haven't implemented this feature yet, which allows for unified evaluation after full editing, but you can refer to the pseudocode in this #220. I will improve this feature in the next version. Thank you!

zxlzr commented 3 months ago

Hi, do you have any further questions?

piotrmigdalek commented 3 months ago

Nothing as of now, thanks :)

zjunlp / EasyEdit

Evaluation specifics #251