chinese_bloom通过ds训练报错：RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

shaoqing404 commented 1 year ago

hi，我在尝试进行训练的过程中出现了问题，可否帮我关注一下？ GPU配置：3090X2 系统：ubuntu20.04 我在网络上找到的一些解决方案可能有先入为主的误导：他们说因为是双卡在跑所以某些数据可能在另外一张卡上。。。 3090一张卡显存太小了，也许换一张显存大的会更好？

恳请您的解答

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

错误问题1

错误信息：位置信息如下

/File "/home/ash404/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2516, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

yuanzhoulvpi2017 commented 1 year ago

参考这个回答https://github.com/yuanzhoulvpi2017/zero_nlp/issues/118#issuecomment-1574763709

shaoqing404 commented 1 year ago

参考这个回答#118 （评论）

这个操作是有效的，之前的错误确认消失了。

shaoqing404 commented 1 year ago

参考这个回答#118 （评论）

但事实上我只有两块3090，合计20G显存，终究逃不过爆显存，没办法做更多确认TT

yuanzhoulvpi2017 commented 1 year ago

如果对于3b及3b以下的模型，其实你两块3090是可以直接全量微调的。
如果你要训练的是7b及7b以上的模型，那还是使用deepspeed-zero3，或者使用量化、lora、qlora等

yuanzhoulvpi2017 / zero_nlp

chinese_bloom通过ds训练报错：RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm) #123