ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.23k stars 1.86k forks source link

Alpaca-plus-7b微调时报错 #754

Closed TCHSDUFH closed 1 year ago

TCHSDUFH commented 1 year ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Alpaca-Plus-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码(如没有可删除该代码块)

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况

运行日志或截图

[INFO|trainer.py:1777] 2023-07-17 15:08:20,504 >> ***** Running training *****
[INFO|trainer.py:1778] 2023-07-17 15:08:20,504 >>   Num examples = 7
[INFO|trainer.py:1779] 2023-07-17 15:08:20,504 >>   Num Epochs = 15
[INFO|trainer.py:1780] 2023-07-17 15:08:20,504 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1781] 2023-07-17 15:08:20,504 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1782] 2023-07-17 15:08:20,504 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1783] 2023-07-17 15:08:20,504 >>   Total optimization steps = 100
[INFO|trainer.py:1784] 2023-07-17 15:08:20,508 >>   Number of trainable parameters = 159,907,840
  0%|                                                                                                                                               | 0/100 [00:00<?, ?it/s][WARNING|logging.py:295] 2023-07-17 15:08:20,528 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/environment/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
  File "run_clm_sft_with_peft.py", line 461, in <module>
    main()
  File "run_clm_sft_with_peft.py", line 423, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/environment/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/environment/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/environment/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2761, in training_step
    self.accelerator.backward(loss)
  File "/home/featurize/work/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1847, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/featurize/work/.local/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1900, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
  0%|                                                                                                                                               | 0/100 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22649) of binary: /environment/miniconda3/bin/python
Traceback (most recent call last):
  File "/environment/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-17_15:08:24
  host      : featurize
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22649)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

在网上搜索了一圈说是多GPU并行问题,但我也没有多gpu,折腾半天解决不了这个错误

TCHSDUFH commented 1 year ago

已解决

1986smalltiger commented 1 year ago

请问是如何解决的?