Open lhtpluto opened 10 months ago
2023-11-03 20:10:25,978 - WARNING - Loading data... Traceback (most recent call last): File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in trainer.train() File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train self.data_module = ChatDataModule( ^^^^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init conversations = jload(data_path) ^^^^^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload jdict = json.load(f) ^^^^^^^^^^^^ File "D:\test\chatglm3-base-tuning-master\env\Lib\json__init__.py", line 293, in load return loads(fp.read(), ^^^^^^^^^ UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence
使用的formatted_samples.json
数据集编码格式不对
正确的编码格式是什么? formatted_samples.json是UTF-8的
utf-8
2023-11-03 20:10:25,978 - WARNING - Loading data... Traceback (most recent call last): File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in
trainer.train()
File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train
self.data_module = ChatDataModule(
^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init
self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init
conversations = jload(data_path)
^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload
jdict = json.load(f)
^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\env\Lib\json__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence
使用的formatted_samples.json