xxw1995 / chatglm3-finetune

最容易上手的0门槛 chatglm3 & agent & langchain 项目
244 stars 36 forks source link

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence #11

Open lhtpluto opened 10 months ago

lhtpluto commented 10 months ago

2023-11-03 20:10:25,978 - WARNING - Loading data... Traceback (most recent call last): File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in trainer.train() File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train self.data_module = ChatDataModule( ^^^^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init conversations = jload(data_path) ^^^^^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload jdict = json.load(f) ^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\env\Lib\json__init__.py", line 293, in load return loads(fp.read(), ^^^^^^^^^ UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

使用的formatted_samples.json

xxw1995 commented 10 months ago

数据集编码格式不对

lhtpluto commented 10 months ago

数据集编码格式不对

正确的编码格式是什么? formatted_samples.json是UTF-8的

hottestme commented 1 month ago

utf-8