多GPU 环境下运行报错

liuhongjie001 commented 6 months ago

使用Qwen-1_8B-chat 在多GPU 环境下运行报RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!这个错，请问怎么解决

xiaozhi-agent commented 6 months ago

把你调用的代码，报错的详细信息发一下。

liuhongjie001 commented 6 months ago

运行环境ai studio,2块GPU 160G；运行代码：

from pylmkit.llms import LocalLLMModel

model = LocalLLMModel(model_path='/home/aiuser/.conda/envs/llm-test1/model/Qwen-1_8B-chat', # 前面保存的模型文件路径 tokenizer_kwargs={"revision": 'master'}, model_kwargs={"revision": 'master'}, language='zh' )

// 普通模式 res = model.invoke(query="如何学习python？") print(">>>invoke ", res) 运行结果： 9:19:49 ~/.conda/envs/llm-test1 $ /home/aiuser/.conda/envs/llm_test/bin/python /home/aiuser/.conda/envs/llm-test1/.vscode/llmDemo.py 2024-04-11 09:20:14,375 - modelscope - INFO - PyTorch version 2.1.2 Found. 2024-04-11 09:20:14,376 - modelscope - INFO - Loading ast index from /home/aiuser/.cache/modelscope/ast_indexer 2024-04-11 09:20:14,472 - modelscope - INFO - No valid ast index found from /home/aiuser/.cache/modelscope/ast_indexer, generating ast index from prebuilt! 2024-04-11 09:20:14,537 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 db09a39b3781812ba6a34416a85b6dff and a total number of 972 components indexed The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency flash-attention/csrc/rotary at main · Dao-AILab/flash-attention Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency flash-attention/csrc/layer_norm at main · Dao-AILab/flash-attention Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|█████████████████████████████████████████████| 2/2 [00:01<00:00, 1.63it/s] You shouldn't move a model that is dispatched using accelerate hooks. Traceback (most recent call last): File "/home/aiuser/.conda/envs/llm-test1/.vscode/llmDemo.py", line 11, in res = model.invoke(query="如何学习python？") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/pylmkit/llms/_huggingface_llm.py", line 48, in invoke response, self.history = self.model.chat(self.tokenizer, query, history, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-chat/modeling_qwen.py", line 1137, in chat outputs = self.generate( ^^^^^^^^^^^^^^ File "/home/aiuser/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-chat/modeling_qwen.py", line 1259, in generate return super().generate( ^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/transformers/generation/utils.py", line 1525, in generate return self.sample( ^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/transformers/generation/utils.py", line 2622, in sample outputs = self( ^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-chat/modeling_qwen.py", line 1043, in forward transformer_outputs = self.transformer( ^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/torch/nn/modules/module.py"return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/llm_test/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aiuser/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-chat/modeling_qwen.py", line 1363, in forward return output self.weight


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

xiaozhi-agent commented 6 months ago

试一下这个：

import torch
from pylmkit.llms import LocalLLMModel

Local = LocalLLMModel(model_path='/home/aiuser/.conda/envs/llm-test1/model/Qwen-1_8B-chat',  # 前面保存的模型文件路径
                      tokenizer_kwargs={"revision": 'master'},
                      model_kwargs={"revision": 'master'},
                      language='zh'
                      )
model = Local.model.to('cuda:0')
Local.model = torch.nn.DataParallel(model, device_ids=[0, 1])  # 将模型复制到两个GPU上
# 普通模式
res = Local.invoke(query="如何学习python？")
print(">>>invoke ", res)

xiaozhi-agent commented 6 months ago

运行成功了吗？

xiaozhi-agent / pylmkit

多GPU 环境下运行报错 #5