训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk.

SKY-ZW commented 1 year ago

Traceback (most recent call last): File "/content/zero_nlp/chatglm_v2_6b_lora/main.py", line 470, in main() File "/content/zero_nlp/chatglm_v2_6b_lora/main.py", line 133, in main model = AutoModel.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2881, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2980, in _load_pretrained_model raise ValueError( ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format

yuanzhoulvpi2017 commented 1 year ago

是否改了代码？
transformers版本是不是最新的，最好更新到最新的。

SKY-ZW commented 1 year ago

没有修改代码
Name: transformers Version: 4.30.2

yuanzhoulvpi2017 commented 1 year ago

“The current device_map had weights offloaded to the disk” 这句话值得是设备没有选好，可能是被分配到disk（也就是硬盘上了）。你检查一下

SKY-ZW commented 1 year ago

我用的是COLAB ，是不是需要本地的显卡？

yuanzhoulvpi2017 commented 1 year ago

原理上来说，在colab里面也是可以跑的。但是这个报错就很奇怪。

我估计是transformers包、accelerate包、peft包的版本问题，你都更新到最新的版本试一试。

如果再出错，我也就不清楚了

SKY-ZW commented 1 year ago

我在colab里面全部pip install --upgrade 下面几个，版本如下，但是还是一样的错误，需要指定版本吗？

Name: transformers
Version: 4.30.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [transformers@huggingface.co](mailto:transformers@huggingface.co)
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft
---
Name: accelerate
Version: 0.22.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [sylvain@huggingface.co](mailto:sylvain@huggingface.co)
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy, packaging, psutil, pyyaml, torch
Required-by: peft
---
Name: peft
Version: 0.5.0
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: [sourab@huggingface.co](mailto:sourab@huggingface.co)
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by:

yuanzhoulvpi2017 commented 1 year ago

我也搞不懂了～

SKY-ZW commented 1 year ago

我只是改了一下train.sh里面的--model_name_or_path /content/zero_nlp/chatglm_v2_6b_lora/chatglm2-6b 这个路径，其他都没动过

yuanzhoulvpi2017 commented 1 year ago

这个不影响的，不清楚了

SKY-ZW commented 1 year ago

我在133行加了offload_folder = "offload_folder" 以后就不报上面错了，但是又出现另外一个问题： Loading checkpoint shards: 71% 5/7 [01:03<00:23, 11.53s/it]train.sh: line 24: 5399 Killed

SKY-ZW commented 1 year ago

感觉是内存用多了被COLAB给KILL了

yuanzhoulvpi2017 / zero_nlp

训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153