只训练大模型，并行出错

safehumeng commented 1 year ago

大佬可以帮忙看看么，如果不加载Lora部分的模型，直接微调大模型，想并行会报错如下 Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0

具体位置如上，具体报错如下 cuda:0 cuda:0 cuda:0 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /workspace/ChatGLM-Tuning/finetune_all_fp32.py:343 in │ │ │ │ 340 │ train_dataset=tokenized_datasets["train"], │ │ 341 │ eval_dataset=tokenized_datasets["valid"], │ │ 342 ) │ │ ❱ 343 trainer.train() │ │ 344 trainer.save_model(args.output_dir) │ │ 345 │ │ │ │ /workspace/ChatGLM-Tuning/MyTrainer.py:1629 in train │ │ │ │ 1626 │ │ inner_training_loop = find_executable_batch_size( │ │ 1627 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1628 │ │ ) │ │ ❱ 1629 │ │ return inner_training_loop( │ │ 1630 │ │ │ args=args, │ │ 1631 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1632 │ │ │ trial=trial, │ │ │ │ /workspace/ChatGLM-Tuning/MyTrainer.py:1896 in _inner_training_loop │ │ │ │ 1893 │ │ │ │ │ with model.no_sync(): │ │ 1894 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1895 │ │ │ │ else: │ │ ❱ 1896 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1897 │ │ │ │ │ │ 1898 │ │ │ │ tr_loss = tr_loss.to(tr_loss_step.device) │ │ 1899 │ │ │ │ /workspace/ChatGLM-Tuning/MyTrainer.py:2654 in training_step │ │ │ │ 2651 │ │ if self.do_grad_scaling: │ │ 2652 │ │ │ loss = loss.to(self.args.device) │ │ 2653 │ │ │ print(loss.device, self.args.device, inputs["input_ids"].device) │ │ ❱ 2654 │ │ │ self.scaler.scale(loss).backward() │ │ 2655 │ │ elif self.use_apex: │ │ 2656 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │ │ 2657 │ │ │ │ scaled_loss.backward() │ │ │ │ /opt/conda/lib/python3.8/site-packages/torch/_tensor.py:487 in backward │ │ │ │ 484 │ │ │ │ create_graph=create_graph, │ │ 485 │ │ │ │ inputs=inputs, │ │ 486 │ │ │ ) │ │ ❱ 487 │ │ torch.autograd.backward( │ │ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │ │ 489 │ │ ) │ │ 490 │ │ │ │ /opt/conda/lib/python3.8/site-packages/torch/autograd/init.py:200 in backward │ │ │ │ 197 │ # The reason we repeat same the comment below is that │ │ 198 │ # some Python versions print out the first line of a multi-line function │ │ 199 │ # calls in the traceback and some print out the last line │ │ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │ │ 201 │ │ tensors, gradtensors, retain_graph, create_graph, inputs, │ │ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │ │ 203 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

cywjava commented 1 year ago

网络层分配的问题

safehumeng commented 1 year ago

网络层分配的问题

但由于是算loss的时候出错，能麻烦问下怎么好定位是哪一层出的问题呢

safehumeng commented 1 year ago

加上lora再设定训练全部参数也会报错，梯度回传的时候怎么设定卡呢

safehumeng commented 1 year ago

不训练transformer.word_embeddings的所有层就不会报错

chen-xinyu commented 1 year ago

我也遇到这个问题，用lora就没问题，去掉lora的部分就会出现这个报错，有什么解决办法么

safehumeng commented 1 year ago

我也遇到这个问题，用lora就没问题，去掉lora的部分就会出现这个报错，有什么解决办法么

先冻结transformer.word_embeddings，现在能训90%参数，之后怎么训我还没看，无非就是反向传播的时候怎么获取当前设备并移动

yuanzhoulvpi2017 commented 1 year ago

目前，微调全量参数的代码，基本完成，已经进入训练和调试阶段，后面会放出来~等一等~

yuanzhoulvpi2017 / zero_nlp

只训练大模型，并行出错 #56