ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.04k stars 581 forks source link

现在模型训练的时候总是会报错 下列是错误原因;辛苦帮忙看一下 golang过来的小菜鸟 #433

Closed Mr1994 closed 8 months ago

Mr1994 commented 9 months ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

bash training/run_pt.sh
[2023-12-01 19:09:27,828] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-01 19:09:30,011] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-01 19:09:30,011] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

12/01/2023 19:11:40 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:715] 2023-12-01 19:11:40,996 >> loading configuration file /llm/llama.cpp/models/chinese-alpaca-2-7b-hf/config.json
[INFO|configuration_utils.py:777] 2023-12-01 19:11:40,997 >> Model config LlamaConfig {
  "_name_or_path": "/llm/llama.cpp/models/chinese-alpaca-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 55296
}

[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file tokenizer.json
[WARNING|logging.py:329] 2023-12-01 19:11:40,998 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
12/01/2023 19:11:41 - INFO - __main__ - training datasets-test has been loaded from disk
Traceback (most recent call last):
  File "/llm/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 720, in <module>
    main()
  File "/llm/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 549, in main
    lm_datasets = lm_datasets.train_test_split(test_size = data_args.validation_split_percentage)
  File "/home/anaconda3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 556, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 4437, in train_test_split
    raise ValueError(
ValueError: With n_samples=1, test_size=0.001 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40985) of binary: /home/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/llm/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-01_19:11:44
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 40985)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

### 依赖情况(代码类问题务必提供)

// 这个是我run_pt.sh 的配置

pretrained_model=/llm/llama.cpp/models/chinese-alpaca-2-7b-hf // 这里是我模型存放的位置 我下载的是7b基座模型 chinese_tokenizer_path=/llm/Chinese-LLaMA-Alpaca-2/scripts/tokenizer .// 这里是仓库 tokenizer 的目录 dataset_dir=/llm/Chinese-LLaMA-Alpaca-2/dataset // 这个是我想要训练的数据 内容为:您知道孙中山先生吗 :他是世界上最伟大的人 data_cache=/llm/Chinese-LLaMA-Alpaca-2/temp_data_cache_dir // 这个是我想要输入的模型位置 per_device_train_batch_size=1 gradient_accumulation_steps=8 block_size=512 output_dir=output_dir



### 运行日志或截图

![image](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/assets/15978040/293587d8-5c51-4355-8e02-d1dfb25915bc)
iMountTai commented 9 months ago

截图挂了。不过从你上面的记录看,是样本数量太少,导致分训练集和测试集时,训练集为空,建议增加样本试试

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 8 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.