yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
MIT License
2.85k stars 355 forks source link

关掉Lora微调大模型,模型并行训练报错:Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #64

Open huangcaiyun opened 1 year ago

huangcaiyun commented 1 year ago

不知道作者大大有没有试过不用lora,直接模型并行的训练的方式,请问能帮忙看看么,找了好久没定位到问题~~

huangcaiyun commented 1 year ago
image
yuanzhoulvpi2017 commented 1 year ago

是的,我也是遇到这个问题,目前我还没修改这个bug😂

huangcaiyun commented 1 year ago

具体是哪块出了bug,您现在有个大概的思路么,另外,提个小建议,考不考虑用Pipelining Inputs来加速模型并行训练呀

yuanzhoulvpi2017 commented 1 year ago

具体原因,我还没找到,不知道怎么回事

janglichao commented 1 year ago

我 也遇到这个bug,怎么解决啊,在线蹲个方案

Tianranse commented 1 year ago

预测也遇到相似问题,有大佬有解吗 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_mm)

ray-008 commented 1 year ago

在推理的时候和你报一样的错,我在引入torch之后,设置默认tensor_type就没报错了。

import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')