Closed Phinease closed 1 year ago
模型训练与精调
Llama-13B
CentOS
当前使用A100 * 4进行训练,DS参数如下:
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 100, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1e-10 }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 1e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1e8, "contiguous_gradients": true, }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
训练成功结束:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [21:13<00:00, 12.73s/it] [INFO|tokenization_utils_base.py:2194] 2023-07-26 12:05:43,496 >> tokenizer config file saved in /home/huguangxing/code/chatops/trained_model/chinese_alpaca_13b_16_64_100/pt_lora_model/tokenizer_config.json [INFO|tokenization_utils_base.py:2201] 2023-07-26 12:05:43,496 >> Special tokens file saved in /home/huguangxing/code/chatops/trained_model/chinese_alpaca_13b_16_64_100/pt_lora_model/special_tokens_map.json ***** train metrics ***** epoch = 0.09 train_loss = 1.7126 train_runtime = 0:21:13.42 train_samples = 4272 train_samples_per_second = 0.314 train_steps_per_second = 0.079 07/26/2023 12:05:43 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:3200] 2023-07-26 12:05:43,815 >> ***** Running Evaluation ***** [INFO|trainer.py:3202] 2023-07-26 12:05:43,815 >> Num examples = 1 [INFO|trainer.py:3205] 2023-07-26 12:05:43,815 >> Batch size = 1 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 717.47it/s] ***** eval metrics ***** epoch = 0.09 eval_accuracy = 0.6403 eval_loss = 1.6641 eval_runtime = 0:00:03.74 eval_samples = 1 eval_samples_per_second = 0.267 eval_steps_per_second = 0.267 perplexity = 5.2807
在保存后出现某些层LoRA保存失败的问题:
base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight : torch.Size([16, 5120]) base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight : torch.Size([5120, 16]) base_model.model.model.layers.0.self_attn.k_proj.lora_A.weight : torch.Size([16, 5120]) base_model.model.model.layers.0.self_attn.k_proj.lora_B.weight : torch.Size([5120, 16]) base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight : torch.Size([16, 5120]) base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight : torch.Size([5120, 16]) base_model.model.model.layers.0.self_attn.o_proj.lora_A.weight : torch.Size([16, 5120]) base_model.model.model.layers.0.self_attn.o_proj.lora_B.weight : torch.Size([5120, 16]) base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight : torch.Size([16, 5120]) base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight : torch.Size([0]) base_model.model.model.layers.0.mlp.down_proj.lora_A.weight : torch.Size([0]) base_model.model.model.layers.0.mlp.down_proj.lora_B.weight : torch.Size([5120, 16]) base_model.model.model.layers.0.mlp.up_proj.lora_A.weight : torch.Size([16, 5120]) base_model.model.model.layers.0.mlp.up_proj.lora_B.weight : torch.Size([0])
麻烦请看一下该如何处理,多谢多谢
pip list | grep -E 'transformers|peft|torch' peft 0.3.0.dev0 torch 2.0.1+cu117 transformers 4.30.2
No response
保存时有没有具体出错信息? 还是说是保存过程不报错?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
提交前必须检查以下项目
问题类型
模型训练与精调
基础模型
Llama-13B
操作系统
CentOS
详细描述问题
当前使用A100 * 4进行训练,DS参数如下:
训练成功结束:
在保存后出现某些层LoRA保存失败的问题:
麻烦请看一下该如何处理,多谢多谢
依赖情况(代码类问题务必提供)
运行日志或截图
No response