[Evaluation Issues]: T5 Results are really strange

I am doing a random 500 sample of edits from Counterfact and I am getting the following results. FT for llama2-7b-chat seems off but mostly T5 results seem very bad. Can you check that the T5 evaluation code is working?

[
    {
        "model": "llama2-7b-chat",
        "method": "ft",
        "edit_success": 0.306047197640118,
        "rephrase_acc": 0.41076696165191745,
    },
    {
        "model": "llama2-7b-chat",
        "method": "serac",
        "edit_success": 0.995575221238938,
        "rephrase_acc": 0.6342182890855457,
    },
    {
        "model": "t5-small",
        "method": "ft",
        "edit_success": 0.052254428610133664,
        "rephrase_acc": 0.052254428754106234,
    },
    {
        "model": "t5-small",
        "method": "serac",
        "edit_success": 0.01774461055869487,
        "rephrase_acc": 0.010779436399687582,
    },
    {
        "model": "gpt2-xl",
        "method": "ft",
        "edit_success": 0.9652509652509652,
        "rephrase_acc": 0.4362934362934363,
    },
    {
        "model": "gpt2-xl",
        "method": "serac",
        "edit_success": 0.9388489208633094,
        "rephrase_acc": 0.38489208633093525,
    },
    {
        "model": "gpt2-xl",
        "method": "memit",
        "edit_success": 0.8115942028985508,
        "rephrase_acc": 0.5181159420289855,
    },
]

zjunlp / EasyEdit

[Evaluation Issues]: T5 Results are really strange #322