ymcui / Chinese-LLaMA-Alpaca-3

中文羊驼大模型三期项目 (Chinese Llama-3 LLMs) developed from Meta Llama 3
Apache License 2.0
1.46k stars 134 forks source link

关于开始训练时出现了建立dataset失败事宜 #82

Closed hk63560892 closed 1 month ago

hk63560892 commented 1 month ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Llama-3-Chinese-8B-Instruct(指令模型)

操作系统

Linux

详细描述问题

作者您好,这边好像是dataset出现问题,可是我用的是default dataset。

FileNotFoundError: Directory /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o_512 is neither a `Dataset` directory nor a `DatasetDict` directory.

pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0

依赖情况(代码类问题务必提供)

bitsandbytes             0.43.1
peft                     0.7.1
sentencepiece            0.1.97
torch                    2.3.1
transformers             4.42.3

运行日志或截图

🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
07/01/2024 08:54:55 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:731] 2024-07-01 08:54:55,574 >> loading configuration file /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/merged_model/config.json
[INFO|configuration_utils.py:800] 2024-07-01 08:54:55,575 >> Model config LlamaConfig {
  "_name_or_path": "/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/merged_model",
  "architectures": [
    "LlamaModel"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2159] 2024-07-01 08:54:55,575 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2159] 2024-07-01 08:54:55,575 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2159] 2024-07-01 08:54:55,575 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2159] 2024-07-01 08:54:55,575 >> loading file tokenizer_config.json
[WARNING|logging.py:313] 2024-07-01 08:54:55,740 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/01/2024 08:54:55 - INFO - __main__ - Training files: /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o.json /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4t.json
07/01/2024 08:54:55 - WARNING - root - building dataset...
Using custom data configuration default-60c211afdaa9cc21
07/01/2024 08:54:56 - INFO - datasets.builder - Using custom data configuration default-60c211afdaa9cc21
Loading Dataset Infos from /home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/packaged_modules/json
07/01/2024 08:54:56 - INFO - datasets.info - Loading Dataset Infos from /home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/packaged_modules/json
Generating dataset json (/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o_512/json/default-60c211afdaa9cc21/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a)
07/01/2024 08:54:56 - INFO - datasets.builder - Generating dataset json (/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o_512/json/default-60c211afdaa9cc21/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a)
Downloading and preparing dataset json/default to /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o_512/json/default-60c211afdaa9cc21/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a...
07/01/2024 08:54:56 - INFO - datasets.builder - Downloading and preparing dataset json/default to /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o_512/json/default-60c211afdaa9cc21/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a...
Downloading took 0.0 min
07/01/2024 08:54:56 - INFO - datasets.download.download_manager - Downloading took 0.0 min
Checksum Computation took 0.0 min
07/01/2024 08:54:56 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min
Generating train split
07/01/2024 08:54:56 - INFO - datasets.builder - Generating train split
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/scripts/training/build_dataset.py", line 65, in build_instruction_dataset
    processed_dataset = datasets.load_from_disk(cache_path)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/load.py", line 2704, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/data/ruozhiba_qa2449_gpt4o_512 is neither a `Dataset` directory nor a `DatasetDict` directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 130, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/builder.py", line 1997, in _prepare_split_single
    for _, table in generator:
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 153, in _generate_tables
    df = pd.read_json(f, dtype_backend="pyarrow")
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/io/json/_json.py", line 815, in read_json
    return json_reader.read()
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1025, in read
    obj = self._get_object_parser(self.data)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1187, in parse
    self._parse()
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1402, in _parse
    self.obj = DataFrame(
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/frame.py", line 851, in __init__
    arrays, columns, index = nested_data_to_arrays(
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 520, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 837, in to_arrays
    arr, columns = _list_of_dict_to_arrays(data, columns)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 918, in _list_of_dict_to_arrays
    columns = ensure_index(pre_cols)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 7647, in ensure_index
    return Index(index_like, copy=copy, tupleize_cols=False)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 565, in __new__
    arr = sanitize_array(data, None, dtype=dtype, copy=copy)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/construction.py", line 654, in sanitize_array
    subarr = maybe_convert_platform(data)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/pandas/core/dtypes/cast.py", line 139, in maybe_convert_platform
    arr = lib.maybe_convert_objects(arr)
  File "lib.pyx", line 2538, in pandas._libs.lib.maybe_convert_objects
TypeError: Cannot convert numpy.ndarray to numpy.ndarray

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 436, in <module>
    main()
  File "/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 297, in main
    train_dataset = build_instruction_dataset(
  File "/home/datascienceroot/hk63560892/Chinese-LLaMA-Alpaca-3/scripts/training/build_dataset.py", line 68, in build_instruction_dataset
    raw_dataset = load_dataset("json", data_files=file, cache_dir=cache_path)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/load.py", line 2616, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/builder.py", line 1029, in download_and_prepare
    self._download_and_prepare(
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/builder.py", line 1124, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/builder.py", line 1884, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/datascienceroot/miniconda3/envs/LLM_model/lib/python3.10/site-packages/datasets/builder.py", line 2040, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 month ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.