Finding a Bug - Githubissues

SonglinZhai commented 1 month ago

The implementations of 'slice_list' and 'test_prediction_acc' probably be still not correct! The variable prompt_len is incorrectly used in slice_list, cauing the labels and answers being not correctly sliced. In addition, the current implementation of prompt_len is unable to perform the correct slice.

Solution: On the one hand, in test_prediction_acc: Line-123

num_pad_toks = [int((i == tok.pad_token_id).sum()) for i in prompt_target_tok['input_ids'].cpu()] modified into: num_answer_toks = [len(i) for i in tok(targets).input_ids] Line-124 prompt_len = [x+y for x,y in zip(num_pad_toks,num_prompt_toks)] modified into: prompt_len = [(x,y) for x,y in zip(num_prompt_toks, num_answer_toks)]

Then, for slice_list: Line-286

return [row[start_index-1:-1] for row, start_index in zip(matrix, start_indices)] modified into: return [row[start_index[0]-1:start_index[0]+start_index[1]-1] for row, start_index in zip(matrix, start_indices)] Line-288 return [row[start_index:] for row, start_index in zip(matrix, start_indices)] modified into: return [row[start_index[0]:start_index[0]+start_index[1]] for row, start_index in zip(matrix, start_indices)] Line-291 return matrix[start_indices[0]-1:-1] modified into: return matrix[start_index[0][0]-1:-1] Line-293 return matrix[start_indices[0]:] modified into: return matrix[start_index[0][0]:]

zxlzr commented 1 month ago

Hi, this file has been updated recently, we will check the code assp.

XeeKee commented 1 month ago

Hello, I will carefully check it tomorrow. If it's convenient for you, could you please explain the significance of your modification? I would be very grateful.

SonglinZhai commented 1 month ago

Recently, I am also doing knowledge editing. I heard about EasyEdit from my friends and want to borrow the evaluation code of this tool in my approach. When I used the evaluation code last night, I found that the results produced by slice_list were not always expected results. i.e., when there is padding, the results are incorrect.

For example:

prompt_target: On which continent is Abrit Nunatak located? South America <|endoftext|> Input_ids: [ 818, 644, 1175, 750, 3271, 1375, 335, 5557, 1907, 287, 30, 2159, 1810, 314, 50256] in that batch data, the padding_size = 1 for the current case

here:

num_prompt_toks = 11 ([818, 644, 1175, 750, 3271, 1375, 335, 5557, 1907, 287, 30]) num_pad_toks = 1 ([50256]) prompt_len = 12 slice result for label: return [row[start_index:] for row, start_index in zip(matrix, start_indices)] [1810, 314, 50256] due to: [0:818, 1:644, 2:1175, 3:750, 4:3271, 5:1375, 6:335, 7:5557, 8:1907, 9:287, 10:30, 11:2159, 12:1810, 13:314, 14:50256] expected results: [2159, 1810, 314]

The motivation behind the above modification (Mentioned last night):

the prompt_target input_ids could be treated as [promt_len, answer_len, pad_len] so, we can slice the labels by row[start_index[0]-1:start_index[0]+start_index[1]-1] where start_index[0] and start_index[1] store the prompt_len and answer_len these two could be calculated by: prompt_len = [(x,y) for x,y in zip(num_prompt_toks, num_answer_toks)]

It is the same for the slice of answer. I hope I do not understand your code incorrectly.

pengzju commented 1 month ago

I attempt to reproduce your issue, and it is indeed a complex case.

The model you are editing should be from the GPT-2 series. My suggestion is to remove<|endoftext|>. This way, no pad_tokens will interfere with the evaluation. In the original tokenizer, there is no pad_token_id, so most implementations assign the eos_token_id to pad_token_id. This results in an additional num_pad_toks.
Since modifying the evaluation has a huge impact, we provide minimal changes to this module. You can try my suggestion to remove the <|endoftext|> (like prompt_target: On which continent is Abrit Nunatak located? South America), which will not affect your editing experiments.

Please give it a try, and I hope everything goes smoothly. Thank you for your valuable feedback.

SonglinZhai commented 1 month ago

Hi pengzju: Thanks for your suggestions. Actually, I have revised this evaluation code last night for my approach and it works well.

However, if the editing is performed on a batch of examples, the pad_token_id must be specified (as mentioned by you). So, the cases with shorter length will be automatically padded with eos_token (if assigning the eos_token_id to pad_token_id), causing an incorrect slice of labels for these cases.

The example I gave above is in this case; and I merely listed a sentence in the batch data for a clearer explanation.

Take a more extreme example (feeding a batch of data):

[prompt_target_1, prompt_target_2, prompt_target_3 ...] If the prompt_target_1 is: On which continent is Abrit Nunatak located? South America <|endoftext|> <|endoftext|> <|endoftext|> these eos_tokens are added automatically since editing is performed in batch data Input_ids: [ 818, 644, 1175, 750, 3271, 1375, 335, 5557, 1907, 287, 30, 2159, 1810, 314, 50256, 50256, 50256] the slice result of label will be [50256, 50256, 50256] due to prompt_len=11+3=14 [0:818, 1:644, 2:1175, 3:750, 4:3271, 5:1375, 6:335, 7:5557, 8:1907, 9:287, 10:30, 11:2159, 12:1810, 13:314, 14:50256, 15:50256, 16:50256] original code: return [row[start_index:] for row, start_index in zip(matrix, start_indices)]

what do you think?

pengzju commented 1 month ago

I understand your point. Regardless, in the scenario of evaluating batch edits, we cannot avoid adding pad_token. This will lead to the issue you mentioned. However, in the editor, we split the batch_size=n evaluation into n evaluations of batch_size=1. This means that the current EasyEdit Evaluation Module should be bug-free.
I also find your suggestion very meaningful, but based on the discussion in #302, shouldn't we avoid using tuples for input (e.g., prompt_len = [(x,y) for x,y in zip(num_prompt_toks, num_answer_toks)])? Instead, we should use text_a + ' ' + text_b. I'm not sure if the code you provided aligns with this concatenation format. Could you please provide a new code example (evaluating text_a + ' ' + text_b)? I will modify the corresponding code based on your suggestion.

Thanks again.

pengzju commented 1 month ago

Additionally, I believe your code will only work when padding_side='right'. Consider if the padding_side is set to left, and batch editing is performed, the tokenization of target_new will result in many pad_token on the left side. These should not be considered as answers or labels but should be ignored.

However, your code includes them as part of the ACC calculation

SonglinZhai commented 1 month ago

I totally agree with you: "If the evaluation into n evaluations of batch_size=1, the evaluation should be bug-free.". The solution I mentioned above only works well with the setting of padding_side='right'.

pengzju commented 1 month ago

Therefore, batch evaluation is a very complex problem, because different editing methods have different corresponding padding_side, so we can only unify them into a single evaluation(bs = 1).

SonglinZhai commented 1 month ago

It is a good choice for the complex situations.

pengzju commented 1 month ago

Thank you very much for the discussion. Actually, the solution you provided, num_answer_toks = [len(i) for i in tok(targets).input_ids], is also bug-free when bs=1. However, I prefer to keep the original code and hope you understand why I separated all batch evaluations. I wish you success with your experiments.

SonglinZhai commented 1 month ago

Thanks.

SonglinZhai commented 1 month ago

One last question: Whether it is unreasonable to set paddng_side = left? because the prompt should be next to the target in the generation scenario mentioned by you.

(left) prompt ... pading ... target VS (right) prompt target ... padding...

If in the editing phase, does this (paddng_side = left) affect model editing performance?

SonglinZhai commented 1 month ago

My friends tested ROME and MEMIT on ZsRE dataset (full data: 19086 cases) based on EasyEdit, and the performance is very poor.

(GPT2-1.5B version)

ROME: 'post': {'pre': {'rewrite_acc': 0.20420484207422276, 'rephrase_acc': 0.198043279039821}, 'post': {'rewrite_acc': 0.056103763416715316, 'rephrase_acc': 0.047820100830160556, 'locality': {'neighborhood_acc': 0.005416384270414259}}}

MEMIT: {'pre': {'rewrite_acc': 0.20420484207422276, 'rephrase_acc': 0.198043279039821}, 'post': {'rewrite_acc': 0.14768535151669387, 'rephrase_acc': 0.12228780976030271, 'locality': {'neighborhood_acc': 0.07205569589870872}}}

If there are no bugs, what hyperparameters in EasyEdit do you think might affect performance? Thanks in advance.

pengzju commented 1 month ago

One last question: Whether it is unreasonable to set paddng_side = left? because the prompt should be next to the target in the generation scenario mentioned by you.

(left) prompt ... pading ... target VS (right) prompt target ... padding...

If in the editing phase, does this (paddng_side = left) affect model editing performance?

据我所知，很多自回归模型的预训练阶段padding_side都是left (https://zhuanlan.zhihu.com/p/646852375）

我觉得理论上是不会影响模型性能的。但从事实的角度来讲，model editing会有副作用从而影响通用能力，相关的文献很多，你可以看一看 (https://arxiv.org/abs/2401.07453)

pengzju commented 1 month ago

If there are no bugs, what hyperparameters in EasyEdit do you think might affect performance?

你可以尝试将sequential_edit设置为False，多次编辑ROME和MEMIT确实很差，很多文献都提到过。

SonglinZhai commented 1 month ago

Thanks very much. We will check this hyperparameter.

pengzju commented 1 month ago

If there are no bugs, what hyperparameters in EasyEdit do you think might affect performance?

你可以尝试将sequential_edit设置为False，多次编辑ROME和MEMIT确实很差，很多文献都提到过。

https://arxiv.org/abs/2211.11031 https://arxiv.org/abs/2405.14768 https://arxiv.org/abs/2402.10987 https://arxiv.org/abs/2403.05330

pengzju commented 1 month ago

Thanks very much. We will check this hyperparameter.

YuxinZhangGit commented 1 month ago

If there are no bugs, what hyperparameters in EasyEdit do you think might affect performance?

你可以尝试将sequential_edit设置为False，多次编辑ROME和MEMIT确实很差，很多文献都提到过。

We checked the sequential_edit setting, which is set to False by default in EasyEdit. Consequently, our test results did not show any improvement. Are there other critical parameters that might influence the final editing performance?

Thank you for your assistance!

XeeKee commented 1 month ago

According to EasyEdit's default parameters, the editing effect should not be too poor. Could you provide more detailed information so that we can better assist you?

zxlzr commented 1 month ago

hi, could you please provide more details, have you solved your issue yet?

YuxinZhangGit commented 1 month ago

According to EasyEdit's default parameters, the editing effect should not be too poor. Could you provide more detailed information so that we can better assist you?

editor_file yaml_file Using the configuration of editor=ROME and base model=gpt2-xl as an example, these two images show the detailed information of the parameter settings.

pengzju commented 3 weeks ago

Delete the code: https://github.com/zjunlp/EasyEdit/blob/main/easyeditor/models/rome/rome_main.py#L56 https://github.com/zjunlp/EasyEdit/blob/main/easyeditor/models/rome/rome_main.py#L57

and try it again

pengzju commented 3 weeks ago

keep_original_weight will be deprecated, you can ignore this params. I will fix this issue asap

pengzju commented 3 weeks ago

Do you have any further questions? @YuxinZhangGit

zjunlp / EasyEdit

Finding a Bug #303