ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
17.9k stars 1.84k forks source link

Chinese-Alpaca-Plus-7B 预训练报错 #750

Closed mazhai closed 11 months ago

mazhai commented 11 months ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

LLaMA-7B

操作系统

Linux

详细描述问题

run_pt.sh

lr=2e-4
lora_rank=8
lora_alpha=32
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

#hf
pretrained_model=/home/mazhai/source/open_source/gtp/llama/7B_hf/
#lora
chinese_tokenizer_path=/home/mazhai/source/open_source/gtp/llama/chinese_alpaca_plus_lora_7b/
#训练集
dataset_dir=/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/data
#训练完是否生成缓存
data_cache=temp_data_cache_dir

per_device_train_batch_size=1
per_device_eval_batch_size=1
training_steps=100
gradient_accumulation_steps=1
#保存模型状态
output_dir=/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/alpaca_output

deepspeed_config_file=ds_zero2_no_offload.json

#nnodes:一台机器 nproc_per_node:启动一个线程 run_clm_pt_with_peft:启动的脚本
torchrun --nnodes 1 --nproc_per_node 1 run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --seed 666 \
    --fp16 \
    --max_steps ${training_steps} \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 500 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 512 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --ddp_find_unused_parameters False

chinese_alpaca_plus_lora_7b adapter_config.json

{
  "bias": "none",
  "enable_lora": null,
  "fan_in_fan_out": false,
  "inference_mode": true,
  "lora_alpha": 128,
  "lora_dropout": 0.05,
  "merge_weights": false,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "target_modules": [
    "q_proj",
    "v_proj",
    "k_proj",
    "o_proj",
    "gate_proj",
    "down_proj",
    "up_proj"
  ],
  "task_type": "CAUSAL_LM"
}

依赖情况(代码类问题务必提供)

peft 0.3.0 torch 2.0.1+cu118 torchaudio 2.0.2+cu118 torchvision 0.15.2+cu118 transformers 4.30.0

系统是: Ubuntu 22.04.2 LTS

运行日志或截图

[2023-07-16 21:03:11,613] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-16 21:03:11,918] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-16 21:03:11,918] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-16 21:03:11,918] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
07/16/2023 21:03:12 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:667] 2023-07-16 21:03:12,173 >> loading configuration file /home/mazhai/source/open_source/gtp/llama/7B_hf/config.json
[INFO|configuration_utils.py:725] 2023-07-16 21:03:12,175 >> Model config LlamaConfig {
  "_name_or_path": "/home/mazhai/source/open_source/gtp/llama/7B_hf/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.30.0",
  "use_cache": true,
  "vocab_size": 32000
}

[INFO|tokenization_utils_base.py:1821] 2023-07-16 21:03:12,175 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:1821] 2023-07-16 21:03:12,176 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1821] 2023-07-16 21:03:12,176 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1821] 2023-07-16 21:03:12,176 >> loading file tokenizer_config.json
07/16/2023 21:03:13 - INFO - datasets.builder - Using custom data configuration default-464cd55ef1707a8e
07/16/2023 21:03:13 - INFO - datasets.info - Loading Dataset Infos from /home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/lib/python3.11/site-packages/datasets/packaged_modules/text
07/16/2023 21:03:13 - INFO - datasets.builder - Generating dataset text (/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/temp_data_cache_dir/pt_sample_data_text/text/default-464cd55ef1707a8e/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
Downloading and preparing dataset text/default to /home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/temp_data_cache_dir/pt_sample_data_text/text/default-464cd55ef1707a8e/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21620.12it/s]
07/16/2023 21:03:13 - INFO - datasets.download.download_manager - Downloading took 0.0 min
07/16/2023 21:03:13 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3443.60it/s]
07/16/2023 21:03:13 - INFO - datasets.builder - Generating train split
07/16/2023 21:03:13 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
Dataset text downloaded and prepared to /home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/temp_data_cache_dir/pt_sample_data_text/text/default-464cd55ef1707a8e/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 964.87it/s]
07/16/2023 21:03:13 - INFO - __main__ - pt_sample_data.txt has been loaded
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #0 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00000_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #1 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00001_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #2 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00002_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #3 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00003_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #4 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00004_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #5 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00005_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #6 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00006_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Process #7 will write at temp_data_cache_dir/pt_sample_data_text/tokenized_00007_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Spawning 8 processes
Running tokenizer on dataset (num_proc=8):   0%|                                                                                                                                                                                                                                     | 0/125987 [00:00<?, ? examples/s]07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00002_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00005_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00001_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00006_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00007_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00000_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00004_of_00008.arrow
07/16/2023 21:03:13 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/tokenized_00003_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Concatenating 8 shards                                                                                                                                                                                                                                           
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #0 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00000_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #1 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00001_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #2 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00002_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #3 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00003_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #4 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00004_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #5 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00005_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #6 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00006_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Process #7 will write at temp_data_cache_dir/pt_sample_data_text/grouped_00007_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Spawning 8 processes
Grouping texts in chunks of 512 (num_proc=8):   0%|                                                                                                                                                                                                                                  | 0/125987 [00:00<?, ? examples/s]07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00002_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00003_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00001_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00000_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00004_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00007_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00005_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching processed dataset at temp_data_cache_dir/pt_sample_data_text/grouped_00006_of_00008.arrow
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Concatenating 8 shards                                                                                                                                                                                                                                           
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching indices mapping at /home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/temp_data_cache_dir/pt_sample_data_text/cache-12dfd35c4cc675e3.arrow                                                                                        
07/16/2023 21:03:14 - INFO - datasets.arrow_dataset - Caching indices mapping at /home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/temp_data_cache_dir/pt_sample_data_text/cache-eedb0e844116c028.arrow
07/16/2023 21:03:14 - INFO - __main__ - Num train_samples  6906
07/16/2023 21:03:14 - INFO - __main__ - training example:
07/16/2023 21:03:14 - INFO - __main__ -  可以节省大量的费用。<s> - 考虑购买经常使用的大宗商品,同时检查以确保你得到最好的价格。<s> - 不要害怕要求折扣或者谈判价格。<s> - 寻找隐藏的折扣或者促销,比如买一送一的促销。<s> 阐述为什么使用计算机完成大学作业是有益的。使用计算机完成大学作业有诸多好处。计算机使学生能够从教授保持联系,使学生合作以产生更高质量的作业。计算机还使学生能够随时随地访问讲座和课程材料。数据库管理和演示软件可用于轻松地创建专业的文档、图像、演示文稿和网站。最后但同样重要的是,计算机通过减轻冗长任务的负担(例如记笔记或计算)帮助节省时间和精力。<s> 写一篇说服顾客重新使用你的产品的劝说信息。尊敬的顾客,<s>更容易使用我们的产品。我们的客户服务团队将非常乐意为您提供个性化指导。<s><s> 我们知道市场上有很多选择,我们很感激您一直以来对我们的信任。如果您有任何问题,请随时与我们联系。我们期待您的回归。<s><s> 此致<s> 敬礼,<s> [贵公司名称]<s> 您正在帮助客户挑选礼物,他们想要一些新奇的东西。列出一些新奇的礼物想法。以下是一些新奇的礼物:<s> - 带有隐藏信息的定制拼图:制作一个有隐藏信息或照片的拼图。<s> - 定制摇头娃娃:制作一只定制的摇头娃娃,可以根据他们最喜欢的爱好、兴趣或事业加以定制。<s> - 定制游戏椅:设计一款带有该人物最喜欢的视频游戏角色和功能的游戏椅。<s>- 3D 打印
[INFO|modeling_utils.py:2575] 2023-07-16 21:03:14,923 >> loading weights file /home/mazhai/source/open_source/gtp/llama/7B_hf/pytorch_model.bin.index.json
[INFO|modeling_utils.py:1173] 2023-07-16 21:03:14,923 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:577] 2023-07-16 21:03:14,923 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0,
  "transformers_version": "4.30.0"
}

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.77s/it]
[INFO|modeling_utils.py:3295] 2023-07-16 21:03:18,561 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:3303] 2023-07-16 21:03:18,561 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /home/mazhai/source/open_source/gtp/llama/7B_hf/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:537] 2023-07-16 21:03:18,563 >> loading configuration file /home/mazhai/source/open_source/gtp/llama/7B_hf/generation_config.json
[INFO|configuration_utils.py:577] 2023-07-16 21:03:18,563 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0,
  "transformers_version": "4.30.0"
}

Traceback (most recent call last):
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py", line 645, in <module>
    main()
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py", line 555, in main
    raise ValueError(
ValueError: The combination of base model (size: 32000) and tokenizer (size: 49954) is not a valid configuration. Please check our project wiki for further information. 
Valid configurations (base model / tokenizer):
- Continue pre-training original LLaMA: 32000 / 32000 
- Pre-training Chinese LLaMA based on original LLaMA: 32000 / 49953 
- Continue pre-training Chinese LLaMA: 49953 / 49953 
- Continue pre-training Chinese Alpaca: 49954 / 49954 

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 32985) of binary: /home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/bin/python
Traceback (most recent call last):
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mazhai/source/open_source/gtp/Chinese-LLaMA-Alpaca/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-16_21:03:20
  host      : mazhai-pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 32985)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
airaria commented 11 months ago

错误信息里说了:

ValueError: The combination of base model (size: 32000) and tokenizer (size: 49954) is not a valid configuration. Please check our project wiki for further information. 
Valid configurations (base model / tokenizer):
- Continue pre-training original LLaMA: 32000 / 32000 
- Pre-training Chinese LLaMA based on original LLaMA: 32000 / 49953 
- Continue pre-training Chinese LLaMA: 49953 / 49953 
- Continue pre-training Chinese Alpaca: 49954 / 49954 

请换用chinese-llama-tokenizer

mazhai commented 11 months ago

@airaria 谢谢,使用错tokenizer了