Closed nlper-fighting closed 3 weeks ago
You can see code here: https://github.com/zjunlp/EasyEdit/blob/main/easyeditor/evaluate/evaluate_utils.py#L277
def verify_answer(model_answer, correct_answer):
if type(correct_answer) is str:
correct_answer = [[correct_answer]]
for answer in correct_answer:
if True not in [possible_answer in model_answer for possible_answer in answer]:
return False
return True
def answer_match(
model,
tok,
prompt: str,
target_new: str,
device,
):
inputs = tok.encode(prompt, return_tensors='pt').to(device)
outputs = model.generate(inputs, temperature=0, max_new_tokens=30)
predict = tok.decode(outputs[0], skip_special_tokens=True)
return verify_answer(predict,target_new)
or you can use GPT-4 / string: exact match
to evaluate
pseudo code
outputs = model.generate(inputs, temperature=0, max_new_tokens=30)
predict = tok.decode(outputs[0], skip_special_tokens=True)
metric = em(predict, target_new) or metric = gpt4_eval(predict, target_new)
Thanks for your reply~
If I set vanilla_generation
to True
in the test_prediction_acc
function, will it enable all models to implement autoregressive generation?
def test_prediction_acc(model, tok, hparams, prompts, targets, device, locality=False, vanilla_generation=False):
when vanilla_generation
is set to True
, the generated token sequence length matches target_new. This means that every token must match exactly for the metric to be 1, which makes the evaluation very strict.
A more reasonable approach would be to let the LLM generate a passage and then check if target_new appears within it, or to calculate recall (as per the code above).
In this case, the accuracy is either 0 or 1, not a token-by-token accuracy.
There is no other way.
If locality is calculated in the same way, can it reach 100%?
It's OK. The locality
is defined as the post-edit model should not change the output of the irrelevant examples.
$\text{Loc}. = \frac{1}{T}\sum\limits{t=1}^{T} {1}(f{\Theta{T}}(x{\text{loc}}^{t}) = f{\Theta{0}}(x_{\text{loc}}^{t}))$
Thanks for your response! In the setting of autoregression, can the calculation process be expressed as the recall of the post-edit model for the loc_prompt divided by the recall of the pre-edit model?
The past extensive literature consists of outputs before and after editing, and I don't understand what you mean by "recall". Moreover, if you wish to change the metrics, there's no need to consult the EasyEdit Team; you can simply follow your own discretion.
At least in my understanding, the concept of recall should not be equivalent to $\text{Loc}. = \frac{1}{T}\sum\limits{t=1}^{T} {1}(f{\Theta{T}}(x{\text{loc}}^{t}) = f{\Theta{0}}(x_{\text{loc}}^{t}))$
The past extensive literature consists of outputs before and after editing, and I don't understand what you mean by "recall". Moreover, if you wish to change the metrics, there's no need to consult the EasyEdit Team; you can simply follow your own discretion.
At least in my understanding, the concept of recall should not be equivalent to Loc . = 1 T ∑ t = 1 T 1 ( f Θ T ( x loc t ) = f Θ 0 ( x loc t ) )
It's Accuracy (Acc.)
you mean token-by-token accuracy for locality in autoregression generation?
Yes
hi buddy, do you have any further questions?
How to implement autoregressive generation instead of teacher-forcing in the inference phase?