Mr1994 commented 9 months ago

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
[X] 第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

bash scripts/training/run_pt.sh
[2023-12-05 11:53:17,052] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-05 11:53:19,238] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-05 11:53:19,238] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
12/05/2023 11:55:30 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:715] 2023-12-05 11:55:30,062 >> loading configuration file /llma2/llama.cpp/models/chinese-alpaca-2-7b-hf/config.json
[INFO|configuration_utils.py:777] 2023-12-05 11:55:30,063 >> Model config LlamaConfig {
  "_name_or_path": "/llma2/llama.cpp/models/chinese-alpaca-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 55296
}

[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,064 >> loading file tokenizer.json
[WARNING|logging.py:329] 2023-12-05 11:55:30,064 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
  File "/llma2/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 720, in <module>
    main()
  File "/llma2/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 549, in main
    lm_datasets = lm_datasets.train_test_split(test_size = data_args.validation_split_percentage)
AttributeError: 'list' object has no attribute 'train_test_split'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 39701) of binary: /llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/bin/python
Traceback (most recent call last):
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/training/run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-05_11:55:33
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 39701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

### 依赖情况（代码类问题务必提供）

请在此处粘贴依赖情况（请粘贴在本代码块里）


### 运行日志或截图

```

iMountTai commented 9 months ago

那你的txt文本是什么格式的？从报错中只能看到你生成的lm_dataset类型错误，不确定是生成的cache有问题还是你的输入数据有问题。

iMountTai commented 9 months ago

你在issue #435 中不是已经跑通代码了吗？

Mr1994 commented 9 months ago

通了后来换了一个服务器。嘻嘻已经好了的确是txt的问题谢谢老板

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.