zjunlp / EasyEdit

[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.
https://zjunlp.github.io/project/KnowEdit
MIT License
1.74k stars 210 forks source link

Problems of reproducing the MEND result of ngram-entropy using gpt-j-6B in counterfact dataset #256

Closed jiqimaoke closed 2 months ago

jiqimaoke commented 3 months ago

I tried to reproduce the result of MEND in gpt-j-6B and Llama-2-7b, but the ngram-entropy of gpt-j-6B is far below Llama-2-7b(gpt-j-6B around 350 vs Llama-2-7b around 550). Do you have any ideas?

Here is my training code:

from easyeditor import EditTrainer, MENDTrainingHparams, CounterFactDataset

training_hparams = MENDTrainingHparams.from_hparams('./hparams/TRAINING/MEND/gpt-j-6B.yaml')

train_ds = CounterFactDataset('data/counterfact/counterfact-train-filtered.json', config=training_hparams)
eval_ds = CounterFactDataset('data/counterfact/counterfact-val.json', config=training_hparams)

trainer = EditTrainer(
    config=training_hparams,
    train_set=train_ds,
    val_set=eval_ds
)

trainer.run()

My training yaml:

# Model
model_name: ./hf_models/gpt-j-6b
model_class: GPTJForCausalLM
tokenizer_class: AutoTokenizer
tokenizer_name: ./hf_models/gpt-j-6b
model_parallel: False
inner_params:
- transformer.h.25.mlp.fc_in.weight
- transformer.h.25.mlp.fc_out.weight
- transformer.h.26.mlp.fc_in.weight
- transformer.h.26.mlp.fc_out.weight
- transformer.h.27.mlp.fc_in.weight
- transformer.h.27.mlp.fc_out.weight

archive: null

# Method
alg: MEND
lr: 1e-6
edit_lr: 1e-4
lr_lr: 1e-4
seed: 42
cedit: 0.1
cloc: 1.0
cbase: 1.0
dropout: 0.0
train_base: False
no_grad_layers: null
one_sided: False
n_hidden: 1
hidden_dim: null
init: id
norm: True
combine: True
x_only: False
delta_only: False
act: relu
rank: 1920
mlp_class: IDMLP
shared: True

# Train
device: cuda:2
batch_size: 1
model_save_pt: 5000
silent: False
#max_epochs: 1
max_iters: 100000
log_interval: 1000
eval_log_interval: 1000
final_eval: True
val_interval: 1000
early_stop_patience: 20000
# early_stop_patience: 30000
early_stop_key: "loss/total_edit_val"
# early_stop_key: "edit/acc_val"
eval_only: False
half: False
debug: False
save: False
verbose: True

val_batch_size: 5
accumulate_bs: 10
val_steps: 500 # only for debug
opt: Adam
grad_clip: 100.

# Output

results_dir: ./results

My eval script:

python run_knowedit_llama2.py \
    --editing_method=MEND \
    --hparams_dir=./hparams/MEND/gpt-j-6B.yaml \
    --data_dir=./data/counterfact/merged_v2.1_new_format.json \
    --datatype='counterfact'
jiqimaoke commented 3 months ago

After replacing the key files of MEND with corresponding files in the rome repository, the results returned to normal.

XeeKee commented 3 months ago

Hello, we apologize for any inconvenience caused. We are currently busy with the deadline for our paper, and have not had enough manpower to handle things in the past few days. We will address it in a few days. In the meantime, you could try adjusting the hyperparameters of the GPT-J-6B model, as our yaml file is not optimized for GPT-J-6B.

pengzju commented 3 months ago
pengzju commented 3 months ago

Additionally, have you reproduced the n-gram entropy on llama-2-7B? It is possible that MEND itself causes repeated tokens (like We We We We We), leading to reduced diversity.

jiqimaoke commented 3 months ago
  • Could you please specify which scripts you mean by "key files"?

I list below the key files I replaced previously.

You can find the corresponding files in the following folders.

jiqimaoke commented 3 months ago

Additionally, have you reproduced the n-gram entropy on llama-2-7B? It is possible that MEND itself causes repeated tokens (like We We We We We), leading to reduced diversity.

I have encountered this situation, but it is a rare case. The n-gram entropy of Llama-2-7b is still between 550-600. Based on the n-gram entropy results from other papers, maybe it is an anomalous phenomenon.

pengzju commented 3 months ago

I think I understand your issue. I'll reproduce it on GPT-J as soon as possible.

Currently, my guess is that the cause might be too many iterations, leading to the MEND hypernetwork overfitting the cross-entropy of tokens (resulting in very high probabilities for certain token IDs), which reduces the generation capability. You could try setting max_iters to 20000. I'm also following up on this issue.

xzwyyd commented 2 months ago

We tried setting max_iters to 20000, and the average n-gram entropy of gpt-j-6b is around 450.

zxlzr commented 2 months ago

Hi, do you have any further questions?

jiqimaoke commented 2 months ago

No further questions. Thx!