现在模型训练的时候总是会报错下列是错误原因；辛苦帮忙看一下 golang过来的小菜鸟

Mr1994 commented 9 months ago

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
[X] 第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

bash training/run_pt.sh
[2023-12-01 19:09:27,828] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-01 19:09:30,011] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-01 19:09:30,011] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

12/01/2023 19:11:40 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:715] 2023-12-01 19:11:40,996 >> loading configuration file /llm/llama.cpp/models/chinese-alpaca-2-7b-hf/config.json
[INFO|configuration_utils.py:777] 2023-12-01 19:11:40,997 >> Model config LlamaConfig {
  "_name_or_path": "/llm/llama.cpp/models/chinese-alpaca-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 55296
}

[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2020] 2023-12-01 19:11:40,998 >> loading file tokenizer.json
[WARNING|logging.py:329] 2023-12-01 19:11:40,998 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
12/01/2023 19:11:41 - INFO - __main__ - training datasets-test has been loaded from disk
Traceback (most recent call last):
  File "/llm/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 720, in <module>
    main()
  File "/llm/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 549, in main
    lm_datasets = lm_datasets.train_test_split(test_size = data_args.validation_split_percentage)
  File "/home/anaconda3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 556, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 4437, in train_test_split
    raise ValueError(
ValueError: With n_samples=1, test_size=0.001 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40985) of binary: /home/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/llm/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-01_19:11:44
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 40985)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

### 依赖情况（代码类问题务必提供）

// 这个是我run_pt.sh 的配置

pretrained_model=/llm/llama.cpp/models/chinese-alpaca-2-7b-hf // 这里是我模型存放的位置我下载的是7b基座模型 chinese_tokenizer_path=/llm/Chinese-LLaMA-Alpaca-2/scripts/tokenizer .// 这里是仓库 tokenizer 的目录 dataset_dir=/llm/Chinese-LLaMA-Alpaca-2/dataset // 这个是我想要训练的数据内容为：您知道孙中山先生吗：他是世界上最伟大的人 data_cache=/llm/Chinese-LLaMA-Alpaca-2/temp_data_cache_dir // 这个是我想要输入的模型位置 per_device_train_batch_size=1 gradient_accumulation_steps=8 block_size=512 output_dir=output_dir



### 运行日志或截图

![image](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/assets/15978040/293587d8-5c51-4355-8e02-d1dfb25915bc)

iMountTai commented 9 months ago

截图挂了。不过从你上面的记录看，是样本数量太少，导致分训练集和测试集时，训练集为空，建议增加样本试试

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 8 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ymcui / Chinese-LLaMA-Alpaca-2