Open Rorschaaaach opened 1 year ago
这个肯定不对的,dddd版本不是用来做Chatglm6b_ModelParallel
的
。。。确实,我改成Chatglm6b_ModelParallel里的model可以了,但是这玩意是不是学不到东西来着,是不是快发布新版本的lora多卡了。。。。
都已经吧新版本的多卡lora
做完了,但是现在懒得发布了😂,chatglm
版本迭代太快了。
哈哈哈,深有感触,周五还能跑的代码周一来更新了,跑不起来了
我的训练数据有一万六千多条,测试数据有四千多条,进度条只显示两千多😥是什么情况啊大佬
都已经吧新版本的
多卡lora
做完了,但是现在懒得发布了😂,chatglm
版本迭代太快了。
别啊,大佬。还等着你的最新代码
大佬,求更新lora的多卡啊 现在发现多卡和单卡的lora,训练时间一样
命令行直接加deepspeed的ddp,在trainer argument里面加deepspeed=“path to/deepspeed config”,实测没问题,整体训练时间会缩短
Traceback (most recent call last): File "train_model_all.py", line 320, in
trainer.train()
File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 1600, in train
return inner_training_loop(
File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 1867, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 2601, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/lirui/Chatglm6b_ModelParallel/MyTrainer.py", line 2634, in compute_loss
outputs = model(inputs)
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, *kwargs)
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 1071, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 901, in forward
layer_ret = torch.utils.checkpoint.checkpoint(
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, args)
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(args)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 897, in custom_forward
return module(inputs, use_cache, output_attentions)
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/thuglm/modeling_chatglm.py", line 571, in forward
attention_input = self.input_layernorm(hidden_states)
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 190, in forward
return F.layer_norm(
File "/opt/conda/envs/ChatGLM/lib/python3.8/site-packages/torch/nn/functional.py", line 2515, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper__native_layer_norm)