ymcui / Chinese-LLaMA-Alpaca-3

中文羊驼大模型三期项目 (Chinese Llama-3 LLMs) developed from Meta Llama 3
Apache License 2.0
1.57k stars 142 forks source link

多卡训练会报错 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: unspecified launch failure #89

Closed cc8476 closed 1 month ago

cc8476 commented 2 months ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Llama-3-Chinese-8B(基座模型)

操作系统

Linux

详细描述问题

使用多卡训练脚本 
torchrun --nnodes 1 --nproc_per_node 2 run_clm_pt_with_peft.py
会报错 (nproc_per_node 设置1不会,能正常完成预训练)
网上查了一整天,主要是检查环节,感觉都没问题,实在查不到问题了...

以下是我的环境:
硬件:
H800 *8
软件:
torch.__version__
2.3.1
torch.version.cuda
12.1

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

另外
NCCL
nvidia-fabricmanager
都已经装好且正常运行

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况(请粘贴在本代码块里)

运行日志或截图



[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2159] 2024-07-15 18:39:43,339 >> loading file tokenizer_config.json
[WARNING|logging.py:313] 2024-07-15 18:39:43,604 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
training datasets-wikipedia-cn-20230720-filtered has been loaded from disk
Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-2fbc8bad28044f1f.arrow
07/15/2024 18:39:43 - INFO - datasets.arrow_dataset - Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-2fbc8bad28044f1f.arrow
Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-940cea3e270b30f4.arrow
07/15/2024 18:39:43 - INFO - datasets.arrow_dataset - Caching indices mapping at /llm/trans_recorder/7_15_2/data_cache/wikipedia-cn-20230720-filtered_1024/train/cache-940cea3e270b30f4.arrow
07/15/2024 18:39:44 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:313] 2024-07-15 18:39:44,945 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905979055/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2496f78897 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2496f28b25 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f249732f718 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1db46 (0x7f24972fab46 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7f24972fc5e3 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7f24972fc922 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a5950 (0x7f2495faf950 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7f2496f5d36f in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f2496f561cb in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2496f56379 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0xe5d280 (0x7f244879d280 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x57692d2 (0x7f248e7cf2d2 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5773d00 (0x7f248e7d9d00 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5773e05 (0x7f248e7d9e05 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x4db0e26 (0x7f248de16e26 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x175be98 (0x7f248a7c1e98 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x577e1b4 (0x7f248e7e41b4 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x577ef65 (0x7f248e7e4f65 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd21ca8 (0x7f249672bca8 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47def4 (0x7f2495e87ef4 in /root/miniconda3/envs/myenv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #20: /root/miniconda3/envs/myenv/bin/python() [0x4fd4c7]
frame #21: _PyObject_MakeTpCall + 0x25b (0x4f6c5b in /root/miniconda3/envs/myenv/bin/python)
frame #22: /root/miniconda3/envs/myenv/bin/python() [0x5093cf]
frame #23: _PyEval_EvalFrameDefault + 0x13b3 (0x4eecf3 in /root/miniconda3/envs/myenv/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6f (0x4fd90f in /root/miniconda3/envs/myenv/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x2b79 (0x4f04b9 in /root/miniconda3/envs/myenv/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6f (0x4fd90f in /root/miniconda3/envs/myenv/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x4b26 (0x4f2466 in /root/miniconda3/envs/myenv/bin/python)
frame #28: /root/miniconda3/envs/myenv/bin/python() [0x5717c7]
frame #29: /root/miniconda3/envs/myenv/bin/python() [0x4fdaf4]
frame #30: _PyEval_EvalFrameDefault + 0x31f (0x4edc5f in /root/miniconda3/envs/myenv/bin/python)
frame #31: /root/miniconda3/envs/myenv/bin/python() [0x509367]
frame #32: _PyEval_EvalFrameDefault + 0x2818 (0x4f0158 in /root/miniconda3/envs/myenv/bin/python)
frame #33: _PyFunction_Vectorcall + 0x6f (0x4fd90f in /root/miniconda3/envs/myenv/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x31f (0x4edc5f in /root/miniconda3/envs/myenv/bin/python)
frame #35: /root/miniconda3/envs/myenv/bin/python() [0x595062]
frame #36: PyEval_EvalCode + 0x87 (0x594fa7 in /root/miniconda3/envs/myenv/bin/python)
frame #37: /root/miniconda3/envs/myenv/bin/python() [0x5c5e17]
frame #38: /root/miniconda3/envs/myenv/bin/python() [0x5c0f60]
frame #39: /root/miniconda3/envs/myenv/bin/python() [0x4595b6]
frame #40: _PyRun_SimpleFileObject + 0x19f (0x5bb4ef in /root/miniconda3/envs/myenv/bin/python)
frame #41: _PyRun_AnyFileObject + 0x43 (0x5bb253 in /root/miniconda3/envs/myenv/bin/python)
frame #42: Py_RunMain + 0x38d (0x5b800d in /root/miniconda3/envs/myenv/bin/python)
frame #43: Py_BytesMain + 0x39 (0x588299 in /root/miniconda3/envs/myenv/bin/python)
frame #44: <unknown function> + 0x29d90 (0x7f24b6591d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #45: __libc_start_main + 0x80 (0x7f24b6591e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: /root/miniconda3/envs/myenv/bin/python() [0x58814e]

W0715 18:39:49.066000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 13549 closing signal SIGTERM
W0715 18:40:19.066000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:868] Unable to shutdown process 13549 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
^CW0715 18:41:12.757000 139889817904960 torch/distributed/elastic/agent/server/api.py:741] Received Signals.SIGINT death signal, shutting down workers
W0715 18:41:12.757000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 13549 closing signal SIGINT
^CW0715 18:41:12.948000 139889817904960 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 13549 closing signal SIGTERM```
github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 month ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.