ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.23k stars 1.86k forks source link

模型sft训练过程中进度条卡住一直不动,也不报错 #710

Closed Qmymy closed 1 year ago

Qmymy commented 1 year ago

提交前必须检查以下项目

问题类型

模型推理

基础模型

Alpaca-7B

操作系统

Linux

详细描述问题

模型sft过程中进度条卡住不动,也不报错,出现很多次这种情况了,请问是怎么回事,有解决办法吗? 我尝试降低过--preprocessing_num_workers 的参数,但是还是会在训练中途卡死

依赖情况(代码类问题务必提供)

accelerate 0.20.3 aiohttp 3.8.4 aiosignal 1.3.1 asttokens 2.0.5 async-timeout 4.0.2 attrs 22.2.0 backcall 0.2.0 blinker 1.4 certifi 2023.5.7 charset-normalizer 3.1.0 cmake 3.26.4 command-not-found 0.3 cryptography 3.4.8 datasets 2.13.1 dbus-python 1.2.18 deepspeed 0.9.5 dill 0.3.6 distro 1.7.0 distro-info 1.1build1 exceptiongroup 1.1.1 filelock 3.12.2 frozenlist 1.3.3 fsspec 2023.6.0 hjson 3.1.0 httplib2 0.20.2 huggingface-hub 0.15.1 idna 3.4 importlib-metadata 4.6.4 iniconfig 2.0.0 jeepney 0.7.1 Jinja2 3.1.2 joblib 1.2.0 keyring 23.5.0 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 lit 16.0.6 MarkupSafe 2.1.3 more-itertools 8.10.0 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 netifaces 0.11.0 networkx 3.1 ninja 1.11.1 numpy 1.25.0 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 oauthlib 3.2.0 packaging 23.1 pandas 2.0.2 peft 0.3.0.dev0 Pillow 9.3.0 pip 22.0.2 pluggy 1.2.0 psutil 5.9.5 py-cpuinfo 9.0.0 pyarrow 12.0.1 pydantic 1.10.9 PyGObject 3.42.1 PyJWT 2.3.0 pyparsing 2.4.7 pytest 7.4.0 python-apt 2.4.0+ubuntu1 python-dateutil 2.8.2 pytz 2023.3 PyYAML 5.4.1 regex 2023.6.3 requests 2.31.0 safetensors 0.3.1 scikit-learn 1.2.2 scipy 1.11.0 SecretStorage 3.3.1 sentencepiece 0.1.97 setuptools 59.6.0 six 1.16.0 sympy 1.12 systemd-python 234 threadpoolctl 3.1.0 tokenizers 0.13.3 tomli 2.0.1 torch 2.0.1 torchaudio 2.0.2+cu117 torchvision 0.15.2+cu117 tqdm 4.65.0 transformers 4.28.1 triton 2.0.0 typing_extensions 4.6.3 tzdata 2023.3 ubuntu-advantage-tools 8001 ufw 0.36.1 unattended-upgrades 0.1 urllib3 2.0.3 wadllib 1.3.6 wheel 0.37.1 xxhash 3.2.0 yarl 1.9.2 zipp 1.0.0

运行日志或截图

{'loss': 1.4134, 'learning_rate': 1.1583333333333333e-05, 'epoch': 0.01} {'loss': 1.8403, 'learning_rate': 1.175e-05, 'epoch': 0.01} {'loss': 1.7563, 'learning_rate': 1.1916666666666667e-05, 'epoch': 0.01} {'loss': 1.5062, 'learning_rate': 1.2083333333333333e-05, 'epoch': 0.01} {'loss': 1.6713, 'learning_rate': 1.225e-05, 'epoch': 0.01} {'loss': 1.7966, 'learning_rate': 1.2416666666666667e-05, 'epoch': 0.01} {'loss': 1.6115, 'learning_rate': 1.2583333333333334e-05, 'epoch': 0.01} {'loss': 1.7219, 'learning_rate': 1.2750000000000002e-05, 'epoch': 0.01} 0%|▎ | 779/400000 [04:44<39:47:44, 2.79it/s]

Qmymy commented 1 year ago

GPU和CPU内存是足够的没有问题

Qmymy commented 1 year ago

核显驱动的问题,操作系统在中途把任务派给了核显导致卡顿,如果有同样问题的朋友尝试禁用核显