yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
MIT License
2.81k stars 351 forks source link

训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153

Open SKY-ZW opened 1 year ago

SKY-ZW commented 1 year ago

Traceback (most recent call last): File "/content/zero_nlp/chatglm_v2_6b_lora/main.py", line 470, in main() File "/content/zero_nlp/chatglm_v2_6b_lora/main.py", line 133, in main model = AutoModel.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2881, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2980, in _load_pretrained_model raise ValueError( ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format

yuanzhoulvpi2017 commented 1 year ago
  1. 是否改了代码?
  2. transformers版本是不是最新的,最好更新到最新的。
SKY-ZW commented 1 year ago

没有修改代码
Name: transformers Version: 4.30.2

yuanzhoulvpi2017 commented 1 year ago

“The current device_map had weights offloaded to the disk” 这句话值得是设备没有选好,可能是被分配到disk(也就是硬盘上了)。你检查一下

SKY-ZW commented 1 year ago

我用的是COLAB ,是不是需要本地的显卡 ?

yuanzhoulvpi2017 commented 1 year ago

原理上来说,在colab里面也是可以跑的。但是这个报错就很奇怪。

我估计是transformers包、accelerate包、peft包的版本问题,你都更新到最新的版本试一试。

如果再出错,我也就不清楚了

SKY-ZW commented 1 year ago

我在colab里面全部pip install --upgrade 下面几个, 版本如下,但是还是一样的错误,需要指定版本吗?

Name: transformers
Version: 4.30.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [transformers@huggingface.co](mailto:transformers@huggingface.co)
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft
---
Name: accelerate
Version: 0.22.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [sylvain@huggingface.co](mailto:sylvain@huggingface.co)
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy, packaging, psutil, pyyaml, torch
Required-by: peft
---
Name: peft
Version: 0.5.0
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: [sourab@huggingface.co](mailto:sourab@huggingface.co)
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by:
yuanzhoulvpi2017 commented 1 year ago

我也搞不懂了~

SKY-ZW commented 1 year ago

我只是改了一下train.sh里面的--model_name_or_path /content/zero_nlp/chatglm_v2_6b_lora/chatglm2-6b 这个路径,其他都没动过

yuanzhoulvpi2017 commented 1 year ago

这个不影响的,不清楚了

SKY-ZW commented 1 year ago

我在133行加了offload_folder = "offload_folder" 以后就不报上面错了 ,但是又出现另外一个问题: Loading checkpoint shards: 71% 5/7 [01:03<00:23, 11.53s/it]train.sh: line 24: 5399 Killed

SKY-ZW commented 1 year ago

感觉是内存用多了 被COLAB给KILL了