RuntimeError: element 0 of tensors does not require grad and does not have a grad_f

phamkhactu commented 10 months ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] Due to frequent dependency updates, please ensure you have followed the steps in our Wiki
[X] I have read the FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, text-generation-webui, LlamaChat, we recommend checking the corresponding project for solutions
[X] Model validity check - Be sure to check the model's SHA256.md. If the model is incorrect, we cannot guarantee its performance

Type of Issue

Model training and fine-tuning

Base Model

LLaMA-7B

Operating System

Linux

Describe your issue in detail

Thanks for your work

I run command:
. run_pt.sh 

But I get error from running, I've found but all replies is that it has fixed, I reinstall with another version, it does not work.

I merge new token. Does it cause this errors and I need to set use_cache=False?

Dependencies (must be provided for code-related issues)

No response

Execution logs or screenshots

[WARNING|logging.py:305] 2023-09-24 16:30:30,071 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-09-24 16:30:30,086 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
  File "run_clm_pt_with_peft.py", line 642, in <module>
    main()
  File "run_clm_pt_with_peft.py", line 610, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 1536, in train
    return inner_training_loop(
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/accelerate/accelerator.py", line 1838, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1958, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
  File "run_clm_pt_with_peft.py", line 642, in <module>
    main()
  File "run_clm_pt_with_peft.py", line 610, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 1536, in train
    return inner_training_loop(
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/accelerate/accelerator.py", line 1838, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1958, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/tupk/tupk/nlp/Chinese-LLaMA-Alpaca/scripts/training/wandb/offline-run-20230924_163000-ydl05elp
wandb: Find logs at: ./wandb/offline-run-20230924_163000-ydl05elp/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3137437) of binary: /home/tupk/anaconda3/envs/nlp/bin/python
Traceback (most recent call last):
  File "/home/tupk/anaconda3/envs/nlp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-09-24_16:30:35
  host      : ai-gpu-server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3137438)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-24_16:30:35
  host      : ai-gpu-server
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3137437)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

phamkhactu commented 10 months ago

hi @airaria After searching and fixing. I dont have permission for create merge request for fixing my case: After line 559. you can add some line code, it works well for me.

    if hasattr(model, "enable_input_require_grads"):
        model.enable_input_require_grads()
    else:
        def make_inputs_require_grad(module, input, output):
            output.requires_grad_(True)

Thank you.

xiaoxin83121 commented 9 months ago

https://github.com/huggingface/transformers/issues/23170#issuecomment-1536455122 The code miss one line, see details in this issue.

ymcui / Chinese-LLaMA-Alpaca

RuntimeError: element 0 of tensors does not require grad and does not have a grad_f #847