Closed wangxigui closed 1 year ago
13B模型本身权重就有24G了,加上lora等优化器状态后,24G肯定跑不了训练,建议探索zero3训练或者升级硬件
使用 8bit 量化可以正常加载了
torch_dtype = (
model_args.torch_dtype
@@ -535,6 +533,8 @@ def main():
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
torch_dtype=torch_dtype,
+ load_in_8bit=True,
+ device_map='auto',
low_cpu_mem_usage=True
)
else:
@@ -558,6 +558,7 @@ def main():
"- Continue pre-training Chinese Alpaca: 49954 / 49954 \n")
model.resize_token_embeddings(len(tokenizer))
+
if training_args.peft_path is not None:
logger.info("Peft from pre-trained model")
model = PeftModel.from_pretrained(model, training_args.peft_path)
@@ -581,11 +582,14 @@ def main():
modules_to_save=modules_to_save)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
+
old_state_dict = model.state_dict
model.state_dict = (
lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))
+ model = prepare_model_for_int8_training(model)
+```
但是目前微调的时候遇到一个报错(优化器 数组越界 IndexError: list index out of range ),求指点一下可能什么原因?
│ /home/ps/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:270 in │
│ __init__ │
│ │
│ 267 │ │ ), f"allgather_bucket_size must be a multiple of nccl_start_alignment_factor, {s │
│ 268 │ │ │
│ 269 │ │ self.all_reduce_print = False │
│ ❱ 270 │ │ **self.dtype = self.optimizer.param_groups[0]['params'][0].dtype** │
│ 271 │ │ │
│ 272 │ │ self.round_robin_bit16_groups = [] │
│ 273 │ │ self.round_robin_bit16_indices = [] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
**IndexError: list index out of range**
![image](https://github.com/ymcui/Chinese-LLaMA-Alpaca/assets/6872439/2bfdffde-6481-4eeb-beb5-451048acbe01)
![image](https://github.com/ymcui/Chinese-LLaMA-Alpaca/assets/6872439/ddf0c267-e279-4205-8a07-d4515e6f35af)
我用的4块卡可以跑
我用的4块卡可以跑
有用 8bit 量化 加载吗?或者别的降显存的操作
我用的4块卡可以跑
4张 3090吗
单卡45G是够的,大概吃42G
单卡45G是够的,大概吃42G
谢谢
我的4个3090,用lora微调pt,显存都用满了,最多用到23G
23G
句子长度多少啊
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.
提交前必须检查以下项目
问题类型
模型训练与精调
基础模型
Alpaca-Plus-13B
操作系统
Linux
详细描述问题
依赖情况(代码类问题务必提供)
No response
运行日志或截图