ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.04k stars 581 forks source link

训练的时候会提示这个错误是我的txt文本有问题吗 #438

Closed Mr1994 closed 8 months ago

Mr1994 commented 9 months ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

bash scripts/training/run_pt.sh
[2023-12-05 11:53:17,052] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-05 11:53:19,238] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-05 11:53:19,238] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
12/05/2023 11:55:30 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:715] 2023-12-05 11:55:30,062 >> loading configuration file /llma2/llama.cpp/models/chinese-alpaca-2-7b-hf/config.json
[INFO|configuration_utils.py:777] 2023-12-05 11:55:30,063 >> Model config LlamaConfig {
  "_name_or_path": "/llma2/llama.cpp/models/chinese-alpaca-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.2",
  "use_cache": true,
  "vocab_size": 55296
}

[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,063 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2020] 2023-12-05 11:55:30,064 >> loading file tokenizer.json
[WARNING|logging.py:329] 2023-12-05 11:55:30,064 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
  File "/llma2/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 720, in <module>
    main()
  File "/llma2/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 549, in main
    lm_datasets = lm_datasets.train_test_split(test_size = data_args.validation_split_percentage)
AttributeError: 'list' object has no attribute 'train_test_split'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 39701) of binary: /llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/bin/python
Traceback (most recent call last):
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/llma2/Chinese-LLaMA-Alpaca-2/Chinese-LLaMA-Alpaca-2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/training/run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-05_11:55:33
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 39701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

### 依赖情况(代码类问题务必提供)

请在此处粘贴依赖情况(请粘贴在本代码块里)


### 运行日志或截图

image```

iMountTai commented 9 months ago

那你的txt文本是什么格式的?从报错中只能看到你生成的lm_dataset类型错误,不确定是生成的cache有问题还是你的输入数据有问题。

iMountTai commented 9 months ago

你在issue #435 中不是已经跑通代码了吗?

Mr1994 commented 9 months ago

通了 后来换了一个服务器 。嘻嘻 已经好了 的确是txt的问题 谢谢老板

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 8 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.