ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.01k stars 571 forks source link

是不是不支持在mac M系列芯片上进行预训练 #493

Closed ban-shi-yi-sheng closed 5 months ago

ban-shi-yi-sheng commented 6 months ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Chinese-Alpaca-2-16K (7B/13B)

操作系统

macOS

详细描述问题

是不是不支持在mac M系列芯片上进行预训练

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况(请粘贴在本代码块里)

运行日志或截图

./run_pt.sh                                                                                                                             ─╯
NOTE: Redirects are currently not supported in Windows or MacOs.
/opt/homebrew/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
Traceback (most recent call last):
  File "/Users/xxxx/AI/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 720, in <module>
    main()
  File "/Users/xxxx/AI/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 375, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 129, in __init__
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/training_args.py", line 1442, in __post_init__
    and (self.device.type != "cuda")
         ^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/training_args.py", line 1887, in device
    return self._setup_devices
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
             ^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/training_args.py", line 1813, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/accelerate/state.py", line 167, in __init__
    assert (
AssertionError: DeepSpeed is not available => install it using `pip3 install deepspeed` or build it from source
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25443) of binary: /opt/homebrew/opt/python@3.11/bin/python3.11
Traceback (most recent call last):
  File "/opt/homebrew/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-04_20:02:16
  host      : wanrendembp.lan
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 25443)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
ymcui commented 6 months ago

自行研究一下吧。

  1. llama.cpp的finetune方法: https://github.com/ggerganov/llama.cpp/tree/master/examples/finetune
  2. mlx: https://github.com/ml-explore/mlx
github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 5 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ban-shi-yi-sheng commented 5 months ago

自行研究一下吧。

  1. llama.cpp的finetune方法:https://github.com/ggerganov/llama.cpp/tree/master/examples/finetune
  2. mlx: https: //github.com/ml-explore/mlx