WISE tokenize - Githubissues

SXxinxiaosong commented 1 month ago

hello, 有一个关于WISE中tokenize()的问题请教一下~ WISE的utils.py的tokenize()中 prompt_ids = tokenizer([f"{templ.format(p)}" for p in prompt for templ in context_templates], return_tensors="pt", padding=True, truncation=True)["input_ids"] 进行了填充，且llama的padding_side=right 之后的操作

num_prompt_toks = [len(i) for i in prompt_ids]
tokens = tokenizer(full_prompt, return_tensors="pt", padding=True, truncation=True)
tokens["labels"] = tokens["input_ids"].clone()
if hparams.objective_optimization == 'only_label':
    for i in range(len(num_prompt_toks)):
        tokens["labels"][i][:num_prompt_toks[i]] = mask_token

那么 tokens["labels"][i][:num_prompt_toks[i]] = mask_token这一步就可能会使得target_new对应某些token_id变为-100

pengzju commented 1 month ago

你好，一直以来都是padding_side = left

pengzju commented 1 month ago

可以看editor.py会有padding_side的赋值

SXxinxiaosong commented 1 month ago

hparams:

alg_name: "WISE"
model_name: "/home/xsong/llama/llama-2-7b-chat"
device: 3

mask_ratio: 0.2
edit_lr: 1.0
n_iter: 70
norm_constraint: 1.0
act_margin: [5.0, 20.0, 10.0] # alpha, beta, gamma    √
act_ratio: 0.88
save_freq: 500
merge_freq: 1000
merge_alg: 'ties'
objective_optimization: 'only_label'
inner_params:
- model.layers[27].mlp.down_proj.weight

## alternative: WISE-Merge, WISE-Retrieve

# for merge (if merge)
densities: 0.53
weights: 1.0

# for retrieve (if retrieve, pls set to True)
retrieve: True
replay: False # True --> will replay the past editing instances: see https://arxiv.org/abs/2405.14768 Appendix B.3

 elif 'llama' in self.model_name.lower():
                self.model = AutoModelForCausalLM.from_pretrained(self.model_name, torch_dtype=torch_dtype, device_map=device_map)
                self.tok = AutoTokenizer.from_pretrained(self.model_name)
                self.tok.pad_token_id = self.tok.eos_token_id
if self.tok is not None and (isinstance(self.tok, GPT2Tokenizer) or isinstance(self.tok, GPT2TokenizerFast) or isinstance(self.tok, LlamaTokenizer)) and (hparams.alg_name not in ['ROME', 'MEMIT']):

    LOG.info('AutoRegressive Model detected, set the padding side of Tokenizer to left...')
    self.tok.padding_side = 'left'
if self.tok is not None and ('mistral' in self.model_name.lower() or 'llama' in self.model_name.lower() or 'qwen' in self.model_name.lower()) and (hparams.alg_name in ['ROME', 'MEMIT']):
    LOG.info('AutoRegressive Model detected, set the padding side of Tokenizer to right...')
    self.tok.padding_side = 'right'
print(self.tok.padding_side)

这一套逻辑下来 padding_side 是 right

SXxinxiaosong commented 1 month ago

应该是要用LlamaTokenizer来加载self.tok

pengzju commented 1 month ago

应该是要用LlamaTokenizer来加载self.tok

make sense，您可以帮我看看AutoTokenizer.from_pretrained(self.model_name)后的tok是什么类型呢？我之前一直使用的都是llama-base不是chat，听起来这是个小问题。如果方便的话可以提个PR

SXxinxiaosong commented 1 month ago

AutoTokenizer.from_pretrained(self.model_name)后的tok是 LlamaTokenizerFast

pengzju commented 1 month ago

改成了use_fast=False，应该可以解决你的问题。感谢您的建议，我将关闭此issue

zjunlp / EasyEdit

WISE tokenize #330