IndexError: piece id is out of range

Honourwei commented 1 year ago

Hi, thank you very much for your excellent work ! I try to run the code using ChatGLM2, but the following error occurs:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.29s/it] [INFO|modeling_utils.py:3032] 2023-09-23 03:06:58,989 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[WARNING|modeling_utils.py:3034] 2023-09-23 03:06:58,989 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at chatglm2-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:2690] 2023-09-23 03:06:58,993 >> Generation config file not found, using a generation config created from the model config. Quantized to 4 bit Running tokenizer on train dataset: 100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1127.52 examples/s] input_ids [1266, 323, 267, 2482, 1074, 22011, 282, 293, 4434, 30987, 3485, 3331, 475, 3238, 428, 30954, 323, 1949, 332, 31007, 276, 11462, 290, 31007, 9026, 10380, 31007, 9026, 1111, 31007, 1011, 843, 356, 31007, 9026, 5736, 16763, 31007, 9026, 7542, 31007, 30929, 2915, 1753, 332, 31007, 536, 286, 291, 31007, 30921, 819, 291, 31007, 9026, 3560, 31007, 4762, 388, 31007, 30920, 1046, 291, 31007, 276, 17805, 289, 31007, 9026, 6200, 2432, 31007, 276, 3271, 291, 31007, 30919, 1211, 3339, 291, 31007, 9026, 2697, 3239, 31007, 22049, 31007, 9026, 1392, 10451, 31007, 261, 13292, 1883, 31007, 276, 16899, 290, 31007, 276, 5382, 289, 31007, 690, 30917, 31007, 4859, 6154, 428, 31007, 11054, 3361, 291, 31007, 276, 4565, 290, 31007, 914, 3915, 31007, 276, 3879, 291, 31007, 2265, 3838, 519, 291, 31007, 11565, 286, 31007, 326, 3915, 31007, 286, 885, 31007, 9026, 5090, 31007, 9026, 2133, 31007, 276, 5233, 289, 31007, 30420, 332, 30930, 150001, 150004, 323, 3271, 291, 150005, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Traceback (most recent call last): File "/root/KGcomplement/KGcomplement/kg-llm-main/ptuning_main.py", line 398, in main() File "/root/KGcomplement/KGcomplement/kg-llm-main/ptuning_main.py", line 225, in main print_dataset_example(train_dataset[0]) File "/root/KGcomplement/KGcomplement/kg-llm-main/ptuning_main.py", line 205, in print_dataset_example print("inputs", tokenizer.decode(example["input_ids"])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3476, in decode return self._decode( ^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 931, in _decode filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 912, in convert_ids_to_tokens tokens.append(self._convert_id_to_token(index)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/tokenization_chatglm.py", line 125, in _convert_id_to_token return self.tokenizer.convert_id_to_token(index) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/tokenization_chatglm.py", line 60, in convert_id_to_token return self.sp_model.IdToPiece(index) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/sentencepiece/init.py", line 1045, in _batched_func return _func(self, arg) ^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/sentencepiece/init.py", line 1038, in _func raise IndexError('piece id is out of range.') IndexError: piece id is out of range.

yao8839836 commented 1 year ago

@Honourwei Hi, thank you for your interest in our work.

I didn't try ChatGLM2 using the current code. It seems to be a little difference in the ChatGLM and ChatGLM fine-tune code.

Maybe running on the same data with the official ChatGLM2 p-tuning-v2 (https://github.com/THUDM/ChatGLM2-6B/tree/main/ptuning) or ChatGLM LoRA fine-tuning (https://github.com/beyondguo/LLM-Tuning) will work. I tried the ChatGLM2 LoRA code on other tasks and it works well.

Honourwei commented 1 year ago

Thanks for your reply, I have already used the official Chatglm2 project (https://github.com/THUDM/ChatGLM2-6B/tree/main/ptuning) to run and it works. It seems that there are indeed a little differences between chatglm and chatglm2. Thank you again for your excellent work and we will refer to you in follow-up works.

cjzhen9 commented 10 months ago

could you provide files with chaglm2-ptuning?thank you very much.

yao8839836 / kg-llm

IndexError: piece id is out of range #3