[INFO|trainer.py:1777] 2023-07-17 15:08:20,504 >> ***** Running training *****
[INFO|trainer.py:1778] 2023-07-17 15:08:20,504 >> Num examples = 7
[INFO|trainer.py:1779] 2023-07-17 15:08:20,504 >> Num Epochs = 15
[INFO|trainer.py:1780] 2023-07-17 15:08:20,504 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1781] 2023-07-17 15:08:20,504 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1782] 2023-07-17 15:08:20,504 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1783] 2023-07-17 15:08:20,504 >> Total optimization steps = 100
[INFO|trainer.py:1784] 2023-07-17 15:08:20,508 >> Number of trainable parameters = 159,907,840
0%| | 0/100 [00:00<?, ?it/s][WARNING|logging.py:295] 2023-07-17 15:08:20,528 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/environment/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "run_clm_sft_with_peft.py", line 461, in <module>
main()
File "run_clm_sft_with_peft.py", line 423, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/environment/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/environment/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/environment/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2761, in training_step
self.accelerator.backward(loss)
File "/home/featurize/work/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1847, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/featurize/work/.local/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1900, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/featurize/work/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/environment/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/environment/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
0%| | 0/100 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22649) of binary: /environment/miniconda3/bin/python
Traceback (most recent call last):
File "/environment/miniconda3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/environment/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-17_15:08:24
host : featurize
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 22649)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
提交前必须检查以下项目
问题类型
模型训练与精调
基础模型
Alpaca-Plus-7B
操作系统
Linux
详细描述问题
依赖情况(代码类问题务必提供)
运行日志或截图
在网上搜索了一圈说是多GPU并行问题,但我也没有多gpu,折腾半天解决不了这个错误