关掉Lora微调大模型，模型并行训练报错：Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0!

yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)

MIT License

2.85k stars 355 forks source link

关掉Lora微调大模型，模型并行训练报错：Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #64

Open huangcaiyun opened 1 year ago

huangcaiyun commented 1 year ago

不知道作者大大有没有试过不用lora，直接模型并行的训练的方式，请问能帮忙看看么，找了好久没定位到问题~~

huangcaiyun commented 1 year ago

yuanzhoulvpi2017 commented 1 year ago

是的，我也是遇到这个问题，目前我还没修改这个bug😂

huangcaiyun commented 1 year ago

具体是哪块出了bug，您现在有个大概的思路么，另外，提个小建议，考不考虑用Pipelining Inputs来加速模型并行训练呀

yuanzhoulvpi2017 commented 1 year ago

具体原因，我还没找到，不知道怎么回事

janglichao commented 1 year ago

我也遇到这个bug，怎么解决啊，在线蹲个方案

Tianranse commented 1 year ago

预测也遇到相似问题，有大佬有解吗 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_mm)

ray-008 commented 1 year ago

在推理的时候和你报一样的错，我在引入torch之后，设置默认tensor_type就没报错了。

import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')