yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
MIT License
2.85k stars 355 forks source link

我的模型文件从dddd版本里下的,Chatglm6b_ModelParallel这个文件夹下,只修改了cuda的配置,训练还是遇到了问题。 #89

Open Rorschaaaach opened 1 year ago

Rorschaaaach commented 1 year ago

Traceback (most recent call last): File "train_model_all.py", line 320, in trainer.train() File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 1600, in train return inner_training_loop( File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 1867, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 2601, in training_step loss = self.compute_loss(model, inputs) File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 2634, in compute_loss outputs = model(inputs) File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward return self.base_model( File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 1071, in forward transformer_outputs = self.transformer( File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 901, in forward layer_ret = torch.utils.checkpoint.checkpoint( File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(args) File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 897, in custom_forward return module(inputs, use_cache, output_attentions) File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 571, in forward attention_input = self.input_layernorm(hidden_states) File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 190, in forward return F.layer_norm( File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/functional.py", line 2515, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper__native_layer_norm)

yuanzhoulvpi2017 commented 1 year ago

这个肯定不对的,dddd版本不是用来做Chatglm6b_ModelParallel

Rorschaaaach commented 1 year ago

。。。确实,我改成Chatglm6b_ModelParallel里的model可以了,但是这玩意是不是学不到东西来着,是不是快发布新版本的lora多卡了。。。。

yuanzhoulvpi2017 commented 1 year ago

都已经吧新版本的多卡lora做完了,但是现在懒得发布了😂,chatglm版本迭代太快了。

Rorschaaaach commented 1 year ago

哈哈哈,深有感触,周五还能跑的代码周一来更新了,跑不起来了

Rorschaaaach commented 1 year ago

我的训练数据有一万六千多条,测试数据有四千多条,进度条只显示两千多😥是什么情况啊大佬Screenshot_2023-04-20-21-40-55-661_com.oray.sunlogin.jpg

juemifuji commented 1 year ago

都已经吧新版本的多卡lora做完了,但是现在懒得发布了😂,chatglm版本迭代太快了。

别啊,大佬。还等着你的最新代码

Ardang666 commented 1 year ago

大佬,求更新lora的多卡啊 现在发现多卡和单卡的lora,训练时间一样

HawkL327 commented 1 year ago

命令行直接加deepspeed的ddp,在trainer argument里面加deepspeed=“path to/deepspeed config”,实测没问题,整体训练时间会缩短