Open StarLooo opened 1 day ago
all_metrics
can be elaborated as below (based on my understanding, take the post eidt metrics on _Wikidatarecent for example) :
for the Edit Succ.: compute the average rewrite_acc
of all edit samples.
for the Portability: first compute the average Subject_Aliasing_acc
&reasoning_acc
&Logical_Generalization_acc
of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples.
for the Locality: first compute the average Relation_Specificity_acc
&Forgetfulness_acc
of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples.
for the Fluency: compute the average ngram_entropy
of all edit samples.
If the above summarization method is correct, the implementation of the summary_metrics
function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions.
This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics)
. when summarizing Portability, this line of code will try to compute np.mean()
over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()]
into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]
. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics)
to summary.
Besides, I find that the summary_metrics
function do not summary the Fluency metric and the magnitude of ngram_entropy
is too low compared to the reported Fluency metric (5.x~6.x vs ~500).rewrite_acc
, how to compute it and why it can be greater than 0 before we conduct any editing.post_rewrite_acc=1.0
in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857
. I'm not sure it's due to the fp16
setting as mentioned in #290 and I'll try it agian using fp16=false
.If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!
Thank you very much for your attention to EasyEdit. We will address your issue shortly.
- can not reproduce the KnowEdit results when using AdaLoRA or ROME I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a
post_rewrite_acc=1.0
in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as RuntimeError: probability tensor contains eitherinf
,nan
or element < 0 #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper withpost_rewrite_acc=0.9857
. I'm not sure it's due to thefp16
setting as mentioned in RuntimeError: probability tensor contains eitherinf
,nan
or element < 0 #290 and I'll try it agian usingfp16=false
.If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!
Apologies for the issue regarding the results. EasyEdit has been under maintenance to improve its editing performance, and we have optimized the EasyEdit code, which has led to better results (possibly an improvement for AdaLoRA). Due to limited computational resources recently, we will update the paper as soon as possible (aiming for 1-2 weeks, and we will notify you). As for the other issues, we will address them as soon as possible.
Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:
- concern about the generation config I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific
generation_config
settings (includingdo_sample
,temperature
,top_p
, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.
You can check the code here: https://github.com/zjunlp/EasyEdit/blob/b6462bea7c1131591935ebe060fff5a2dd2683e5/easyeditor/evaluate/evaluate_utils.py#L80. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.
- perhaps bug in the summary_metrics function According to the discussions in this issue the computing methods of Edit succ, Portability, Locality, Fluency #279 , the summary method of
all_metrics
can be elaborated as below (based on my understanding, take the post eidt metrics on _Wikidatarecent for example) : for the Edit Succ.: compute the averagerewrite_acc
of all edit samples. for the Portability: first compute the averageSubject_Aliasing_acc
&reasoning_acc
&Logical_Generalization_acc
of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples. for the Locality: first compute the averageRelation_Specificity_acc
&Forgetfulness_acc
of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples. for the Fluency: compute the averagengram_entropy
of all edit samples. If the above summarization method is correct, the implementation of thesummary_metrics
function may have a minor bug, but it will causeValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions.
This is due to this line of code:mean_metrics[eval][key][lkey] = np.mean(metrics)
. when summarizing Portability, this line of code will try to computenp.mean()
over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original codemetrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()]
intometrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]
. The modified code will first compute the average sub-score of each edit and then usemean_metrics[eval][key][lkey] = np.mean(metrics)
to summary. Besides, I find that thesummary_metrics
function do not summary the Fluency metric and the magnitude ofngram_entropy
is too low compared to the reported Fluency metric (5.x~6.x vs ~500).
Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.
- the meaning of pre-edit metrics I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit
rewrite_acc
, how to compute it and why it can be greater than 0 before we conduct any editing.
There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:
Fluency
This metric evaluates whether the smoothness of the output has changed after the model performs an edit.
Portability
This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.
Locality
This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.
Rewrite Accuracy (rewrite_acc)
This measures the reliability of the edit.
For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.
Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. My WeChat ID: xi1786594371
Hello,
I have updated the run_llama_knowedit.py
script to include the evaluation of results within the eval
function for the KnowEdit framework.
Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements.
The results you currently get is right.
I am in the process of running additional experiments to gather the correct data.
Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.
DataSet | Metric | AdaLoRA | ROME | MEMIT | FT-L |
---|---|---|---|---|---|
WikiDatarecent | Edit Succ. $\uparrow$ | ||||
Portability $\uparrow$ | |||||
Locality $\uparrow$ | |||||
Fluency $\uparrow$ | |||||
ZsRE | Edit Succ. $\uparrow$ | 100.00 | 97.22 | 95.33 | 53.93 |
Portability $\uparrow$ | 58.03 | 51.99 | 52.70 | 45.64 | |
Locality $\uparrow$ | 35.16 | 65.66 | 63.49 | 38.07 | |
Fluency $\uparrow$ | 548.08 | 571.60 | 563.31 | 492.77 | |
WikiBio | Edit Succ.$\uparrow$ | ||||
Locality $\uparrow$ | |||||
Fluency $\uparrow$ | |||||
WikiDatacounterfact | Edit Succ. $\uparrow$ | ||||
Portability $\uparrow$ | |||||
Locality $\uparrow$ | |||||
Fluency $\uparrow$ |
As to the error you met in ROME, I will try to reproduce it and will let you know.
Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).
You can check the code here:
. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.
I got it, by the default value of do_samples=false
, the EasyEdit uses greedy decoding by default setting, but as you mentioned, if the user's local _generationconfig.json file contains some specific generation hyper-params, it may cause warnings or unexpected generation strategy. Thanks for your timely reply!
Clarification of Misunderstanding
There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:
- Fluency This metric evaluates whether the smoothness of the output has changed after the model performs an edit.
- Portability This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.
- Locality This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.
- Rewrite Accuracy (rewrite_acc) This measures the reliability of the edit.
Why Rewrite Accuracy Can Be Greater Than 0 Before Editing
For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.
Thanks for your clarification! I previously mistakenly thought that rewrite_acc
is a metric based on the exact match. By the way, why the token level accuracy was chosen instead of the exact match when designing the Edit Succ. metric? Was it for a smoother measurement?
Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.
Thanks for your timely reply! So, is my understanding of the method of averaging metrics and the bug reporting in the summary_metrics
function, along with my attempted modifications, correct?
Hello, I have updated the
run_llama_knowedit.py
script to include the evaluation of results within theeval
function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly. Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).
Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py
, and the eval
function seems to be largely consistent with the modified summary_metrics
method I mentioned earlier.
Besides, can you further explain or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?
Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. My WeChat ID: xi1786594371
Thanks for your enthusiasm. If further discussion is needed, I will contact you via WeChat.
Hello, I have updated the
run_llama_knowedit.py
script to include the evaluation of results within theeval
function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.
I just find a little bug in the eval
function in the updated run_knowedit_llama2.py
, before appending np.mean(case_list)
to the Portability_list
(and Locality_list
for robustness), we should first check whether the case_list is empty, or the np.mean(case_list)
will cause RuntimeWarning: Mean of empty slice
and produce nan
inPortability_list
for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.
Hello, I have updated the
run_llama_knowedit.py
script to include the evaluation of results within theeval
function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.I just find a little bug in the
eval
function in the updatedrun_knowedit_llama2.py
, before appendingnp.mean(case_list)
to thePortability_list
(andLocality_list
for robustness), we should first check whether the case_list is empty, or thenp.mean(case_list)
will causeRuntimeWarning: Mean of empty slice
and producenan
inPortability_list
for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.
You're right, I have updated the code here.
Hello, I have updated the
run_llama_knowedit.py
script to include the evaluation of results within theeval
function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly. Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated
run_knowedit_llama2.py
, and theeval
function seems to be largely consistent with the modifiedsummary_metrics
method I mentioned earlier. Also, can you explain more or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?
Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.
Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:
generation_config
settings (includingdo_sample
,temperature
,top_p
, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.