单机多卡训练chat_glm 有误

yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)

MIT License

2.85k stars 355 forks source link

单机多卡训练chat_glm 有误 #86

Open cxj01 opened 1 year ago

cxj01 commented 1 year ago

用仓库代码，虽然电脑上有两块GPU，但是还是加载一块GPU，如果指定各个层在不同GPU上，会报Tensor不在一个device上的错误。

yuanzhoulvpi2017 commented 1 year ago

注意模型代码，使用我提供的代码

cxj01 commented 1 year ago

@yuanzhoulvpi2017 我就是完全使用本仓库的代码，只是我只把最后两层放到了另一个GPU上

yuanzhoulvpi2017 commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

YSLLYW commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码，但是报错，同上，RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

YSLLYW commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码，但是报错，同上，RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

样例跑不通

Ardang666 commented 1 year ago

layers.27和final_layernorm和lm_head必须在同一个卡上。你改一下

完全按照本仓库的代码，但是报错，同上，RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)