yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
MIT License
2.85k stars 355 forks source link

单机多卡训练chat_glm 有误 #86

Open cxj01 opened 1 year ago

cxj01 commented 1 year ago

用仓库代码,虽然电脑上有两块GPU,但是还是加载一块GPU,如果指定各个层在不同GPU上,会报Tensor不在一个device上的错误。

yuanzhoulvpi2017 commented 1 year ago

注意模型代码,使用我提供的代码

cxj01 commented 1 year ago

@yuanzhoulvpi2017 我就是完全使用本仓库的代码,只是我只把最后两层放到了另一个GPU上 image image

yuanzhoulvpi2017 commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

YSLLYW commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码,但是报错,同上,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

YSLLYW commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码,但是报错,同上,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

样例跑不通

Ardang666 commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码,但是报错,同上,RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

定位到那一层,把input.to(和他的weights一样的设备,就能跑过)