ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7k stars 570 forks source link

卡在加载数据集这一步 #537

Closed dehaozhou closed 3 months ago

dehaozhou commented 4 months ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Chinese-Alpaca-2 (7B/13B)

操作系统

Linux

详细描述问题

# 运行脚本前请仔细阅读wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/sft_scripts_zh)
# Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/sft_scripts_zh) carefully before running the script
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/home/dehaozhou/LLaMA2/chinese-alpaca-2-7b-rlhf-hf
chinese_tokenizer_path=/home/dehaozhou/LLaMA2/chinese-alpaca-2-7b-rlhf-hf
dataset_dir=/home/dehaozhou/LLaMA2/scripts/training/yixue_data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/home/dehaozhou/LLaMA2/scripts/training/result_v1
validation_file=/home/dehaozhou/LLaMA2/scripts/training/yixue_data/data.json

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 3 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 30 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --load_in_kbits 16 \
    --save_safetensors False \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False

在指令微调载入数据集的时候一直卡在这个地方不能往后推进

依赖情况(代码类问题务必提供)

_libgcc_mutex             0.1                        main    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
_openmp_mutex             5.1                       1_gnu    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
accelerate                0.27.2                   pypi_0    pypi
aiohttp                   3.9.3                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
annotated-types           0.6.0                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
bitsandbytes              0.41.1                   pypi_0    pypi
ca-certificates           2023.12.12           h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
certifi                   2024.2.2                 pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
cmake                     3.28.3                   pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
coloredlogs               15.0.1                   pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
datasets                  2.18.0                   pypi_0    pypi
deepspeed                 0.13.5                   pypi_0    pypi
dill                      0.3.7                    pypi_0    pypi
filelock                  3.13.1                   pypi_0    pypi
frozenlist                1.4.1                    pypi_0    pypi
fsspec                    2023.10.0                pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
huggingface-hub           0.21.4                   pypi_0    pypi
humanfriendly             10.0                     pypi_0    pypi
idna                      3.6                      pypi_0    pypi
importlib-metadata        6.8.0                    pypi_0    pypi
jinja2                    3.1.3                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libffi                    3.4.4                h6a678d5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-ng                 11.2.0               h1234567_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgomp                   11.2.0               h1234567_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libstdcxx-ng              11.2.0               h1234567_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
lit                       17.0.6                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.4.0                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
multidict                 6.0.5                    pypi_0    pypi
multiprocess              0.70.15                  pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx                  3.2.1                    pypi_0    pypi
ninja                     1.11.1.1                 pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
nvidia-cuda-cupti-cu11    11.7.101                 pypi_0    pypi
nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
nvidia-curand-cu11        10.2.10.91               pypi_0    pypi
nvidia-cusolver-cu11      11.4.0.1                 pypi_0    pypi
nvidia-cusparse-cu11      11.7.4.91                pypi_0    pypi
nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
nvidia-nvtx-cu11          11.7.91                  pypi_0    pypi
opencv-python             4.9.0.80                 pypi_0    pypi
openssl                   3.0.13               h7f8727e_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
packaging                 23.2                     pypi_0    pypi
pandas                    2.2.1                    pypi_0    pypi
peft                      0.3.0                    pypi_0    pypi
pip                       23.3.1           py39h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pipdeptree                2.16.1                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pyarrow                   15.0.0                   pypi_0    pypi
pyarrow-hotfix            0.6                      pypi_0    pypi
pydantic                  2.6.3                    pypi_0    pypi
pydantic-core             2.16.3                   pypi_0    pypi
pynvml                    11.5.0                   pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
python                    3.9.18               h955ad1f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil           2.9.0.post0              pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pywavelets                1.1.1                    pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
requests                  2.31.0                   pypi_0    pypi
safetensors               0.4.2                    pypi_0    pypi
scipy                     1.12.0                   pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                68.2.2           py39h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
shapely                   2.0.3                    pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
sympy                     1.12                     pypi_0    pypi
tabulate                  0.9.0                    pypi_0    pypi
tifffile                  2019.7.26                pypi_0    pypi
tk                        8.6.12               h1ccaba5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tokenizers                0.14.1                   pypi_0    pypi
torch                     2.0.1                    pypi_0    pypi
tqdm                      4.66.2                   pypi_0    pypi
transformers              4.35.0                   pypi_0    pypi
triton                    2.0.0                    pypi_0    pypi
typing-extensions         4.10.0                   pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.2.1                    pypi_0    pypi
wheel                     0.41.2           py39h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.4.6                h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
yarl                      1.9.4                    pypi_0    pypi
zipp                      3.17.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

注:requirements里的包都已经按照版本安好。但是目前情况是datasets需要huggingface-hub的高版本,而tokenizers需要huggingface-hub的低版本。找不到二者的交集版本。如果选择后者,代码会报错,选择前者,代码能够运行起来不报错。这个地方的库确实感觉有点奇怪。

运行日志或截图

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2024-03-07 16:54:19,541] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:54:19,542] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:54:19,544] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:54:22,566] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-07 16:54:22,566] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-07 16:54:22,566] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-07 16:54:22,566] [INFO] [comm.py:637:init_distributed] cdb=None
03/07/2024 16:58:43 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
03/07/2024 16:58:43 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
03/07/2024 16:58:43 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:715] 2024-03-07 16:58:43,909 >> loading configuration file /home/dehaozhou/LLaMA2/chinese-alpaca-2-7b-rlhf-hf/config.json
[INFO|configuration_utils.py:777] 2024-03-07 16:58:43,911 >> Model config LlamaConfig {
  "_name_or_path": "/home/dehaozhou/LLaMA2/chinese-alpaca-2-7b-rlhf-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "end_token_id": 2,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 32000,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.0",
  "use_cache": true,
  "vocab_size": 55296
}
[INFO|tokenization_utils_base.py:2020] 2024-03-07 18:16:47,683 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2020] 2024-03-07 18:16:47,683 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2020] 2024-03-07 18:16:47,683 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2020] 2024-03-07 18:16:47,683 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2020] 2024-03-07 18:16:47,683 >> loading file tokenizer.json
03/07/2024 18:16:47 - INFO - __main__ - Training files: /home/dehaozhou/LLaMA2/scripts/training/yixue_data/data.json
03/07/2024 18:16:47 - WARNING - root - building dataset...
Using custom data configuration default-15935ed09cfc764a
03/07/2024 18:16:48 - INFO - datasets.builder - Using custom data configuration default-15935ed09cfc764a
Loading Dataset Infos from /home/dehaozhou/anaconda3/envs/ll/lib/python3.9/site-packages/datasets/packaged_modules/json
03/07/2024 18:16:48 - INFO - datasets.info - Loading Dataset Infos from /home/dehaozhou/anaconda3/envs/ll/lib/python3.9/site-packages/datasets/packaged_modules/json

代码会卡在最后一行0无法载入数据集。datasets库我尝试了从2.15.0到2.18.0的每个版本,都是这种情况。单卡训练和多卡训练也是这种情况。且运行代码时gpu显存只有360兆占用。

数据集形式: { "instruction": "易学的起源是什么?", "input": "", "output": "易学的起源可以追溯到远古的人类社会,其起源与发展过程漫长且复杂。" }, { "instruction": "易学的作者是谁?", "input": "", "output": "据记载,伏羲创造了八卦图,这是易学的基础。" },

麻烦您帮我看看可能是在什么地方出了问题。

iMountTai commented 4 months ago

code改成False试试

dehaozhou commented 4 months ago

code改成False试试

哦哦我明白您的意思了,谢谢,我试试看

dehaozhou commented 4 months ago

您好,这个问题似乎没有解决,还是卡在这一步: image

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 3 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.