Difficulties to reproduce the KnowEdit results in the survey paper

zjunlp / EasyEdit

[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.

https://zjunlp.github.io/project/KnowEdit

MIT License

1.84k stars 221 forks source link

Difficulties to reproduce the KnowEdit results in the survey paper #390

Open StarLooo opened 1 day ago

StarLooo commented 1 day ago

Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:

concern about the generation config I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific generation_config settings (including do_sample, temperature, top_p, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.

StarLooo commented 1 day ago

perhaps bug in the summary_metrics function According to the discussions in this issue #279 , the summary method of all_metrics can be elaborated as below (based on my understanding, take the post eidt metrics on _Wikidatarecent for example) : for the Edit Succ.: compute the average rewrite_acc of all edit samples. for the Portability: first compute the average Subject_Aliasing_acc&reasoning_acc&Logical_Generalization_acc of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples. for the Locality: first compute the average Relation_Specificity_acc&Forgetfulness_acc of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples. for the Fluency: compute the average ngram_entropy of all edit samples. If the above summarization method is correct, the implementation of the summary_metrics function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics). when summarizing Portability, this line of code will try to compute np.mean() over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()] into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics) to summary. Besides, I find that the summary_metrics function do not summary the Fluency metric and the magnitude of ngram_entropy is too low compared to the reported Fluency metric (5.x~6.x vs ~500).

StarLooo commented 1 day ago

the meaning of pre-edit metrics I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit rewrite_acc, how to compute it and why it can be greater than 0 before we conduct any editing.

StarLooo commented 1 day ago

can not reproduce the KnowEdit results when using AdaLoRA or ROME I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a post_rewrite_acc=1.0 in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857. I'm not sure it's due to the fp16 setting as mentioned in #290 and I'll try it agian using fp16=false.

If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!

XeeKee commented 1 day ago

Thank you very much for your attention to EasyEdit. We will address your issue shortly.

zxlzr commented 1 day ago

can not reproduce the KnowEdit results when using AdaLoRA or ROME I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a post_rewrite_acc=1.0 in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as RuntimeError: probability tensor contains either inf, nan or element < 0 #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857. I'm not sure it's due to the fp16 setting as mentioned in RuntimeError: probability tensor contains either inf, nan or element < 0 #290 and I'll try it agian using fp16=false.

If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!

Apologies for the issue regarding the results. EasyEdit has been under maintenance to improve its editing performance, and we have optimized the EasyEdit code, which has led to better results (possibly an improvement for AdaLoRA). Due to limited computational resources recently, we will update the paper as soon as possible (aiming for 1-2 weeks, and we will notify you). As for the other issues, we will address them as soon as possible.

XeeKee commented 1 day ago

Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:

concern about the generation config I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific generation_config settings (including do_sample, temperature, top_p, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.

You can check the code here: https://github.com/zjunlp/EasyEdit/blob/b6462bea7c1131591935ebe060fff5a2dd2683e5/easyeditor/evaluate/evaluate_utils.py#L80. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.

XeeKee commented 1 day ago

perhaps bug in the summary_metrics function According to the discussions in this issue the computing methods of Edit succ, Portability, Locality, Fluency #279 , the summary method of all_metrics can be elaborated as below (based on my understanding, take the post eidt metrics on _Wikidatarecent for example) : for the Edit Succ.: compute the average rewrite_acc of all edit samples. for the Portability: first compute the average Subject_Aliasing_acc&reasoning_acc&Logical_Generalization_acc of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples. for the Locality: first compute the average Relation_Specificity_acc&Forgetfulness_acc of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples. for the Fluency: compute the average ngram_entropy of all edit samples. If the above summarization method is correct, the implementation of the summary_metrics function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics). when summarizing Portability, this line of code will try to compute np.mean() over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()] into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics) to summary. Besides, I find that the summary_metrics function do not summary the Fluency metric and the magnitude of ngram_entropy is too low compared to the reported Fluency metric (5.x~6.x vs ~500).

Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.

XeeKee commented 1 day ago

the meaning of pre-edit metrics I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit rewrite_acc, how to compute it and why it can be greater than 0 before we conduct any editing.

Clarification of Misunderstanding

There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:

Fluency
This metric evaluates whether the smoothness of the output has changed after the model performs an edit.
Portability
This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.
Locality
This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.
Rewrite Accuracy (rewrite_acc)
This measures the reliability of the edit.

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.

XeeKee commented 1 day ago

Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. My WeChat ID: xi1786594371

littlefive5 commented 18 hours ago

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

DataSet	Metric	AdaLoRA	ROME	MEMIT	FT-L
WikiData_recent	Edit Succ. $\uparrow$
	Portability $\uparrow$
	Locality $\uparrow$
	Fluency $\uparrow$
ZsRE	Edit Succ. $\uparrow$	100.00	97.22	95.33	53.93
	Portability $\uparrow$	58.03	51.99	52.70	45.64
	Locality $\uparrow$	35.16	65.66	63.49	38.07
	Fluency $\uparrow$	548.08	571.60	563.31	492.77
WikiBio	Edit Succ.$\uparrow$
	Locality $\uparrow$
	Fluency $\uparrow$
WikiData_counterfact	Edit Succ. $\uparrow$
	Portability $\uparrow$
	Locality $\uparrow$
	Fluency $\uparrow$

As to the error you met in ROME, I will try to reproduce it and will let you know.

Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

StarLooo commented 15 hours ago

You can check the code here:

https://github.com/zjunlp/EasyEdit/blob/b6462bea7c1131591935ebe060fff5a2dd2683e5/easyeditor/evaluate/evaluate_utils.py#L80

. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.

I got it, by the default value of do_samples=false, the EasyEdit uses greedy decoding by default setting, but as you mentioned, if the user's local _generationconfig.json file contains some specific generation hyper-params, it may cause warnings or unexpected generation strategy. Thanks for your timely reply！

StarLooo commented 15 hours ago

Clarification of Misunderstanding

There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:

Fluency This metric evaluates whether the smoothness of the output has changed after the model performs an edit.

Portability This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.

Locality This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.

Rewrite Accuracy (rewrite_acc) This measures the reliability of the edit.

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.

Thanks for your clarification! I previously mistakenly thought that rewrite_acc is a metric based on the exact match. By the way, why the token level accuracy was chosen instead of the exact match when designing the Edit Succ. metric? Was it for a smoother measurement?

StarLooo commented 15 hours ago

Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.

Thanks for your timely reply! So, is my understanding of the method of averaging metrics and the bug reporting in the summary_metrics function, along with my attempted modifications, correct?

StarLooo commented 15 hours ago

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly. Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py, and the eval function seems to be largely consistent with the modified summary_metrics method I mentioned earlier. Besides, can you further explain or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?

StarLooo commented 15 hours ago

Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. My WeChat ID: xi1786594371

Thanks for your enthusiasm. If further discussion is needed, I will contact you via WeChat.

StarLooo commented 14 hours ago

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

I just find a little bug in the eval function in the updated run_knowedit_llama2.py, before appending np.mean(case_list) to the Portability_list (and Locality_list for robustness), we should first check whether the case_list is empty, or the np.mean(case_list) will cause RuntimeWarning: Mean of empty slice and produce nan inPortability_list for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.

littlefive5 commented 14 hours ago

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

I just find a little bug in the eval function in the updated run_knowedit_llama2.py, before appending np.mean(case_list) to the Portability_list (and Locality_list for robustness), we should first check whether the case_list is empty, or the np.mean(case_list) will cause RuntimeWarning: Mean of empty slice and produce nan inPortability_list for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.

You're right, I have updated the code here.

littlefive5 commented 12 hours ago

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly. Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py, and the eval function seems to be largely consistent with the modified summary_metrics method I mentioned earlier. Also, can you explain more or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.