yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
MIT License
2.81k stars 351 forks source link

关于chatglm-6b lora 微调时,直接使用deepspeed进行多级多卡微调问题 #174

Open Tom722 opened 3 months ago

Tom722 commented 3 months ago

deepspeed.json:

 {
      "fp16": {
          "enabled": true,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "initial_scale_power": 16,
          "hysteresis": 2,
          "min_loss_scale": 1
      },
      "optimizer": {
          "type": "AdamW",
          "params": {
              "lr": 3e-5,
              "betas": [0.8, 0.999],
              "eps": 1e-8,
              "weight_decay": 3e-7
          }
      },

      "scheduler": {
          "type": "WarmupLR",
          "params": {
              "warmup_min_lr": 0,
              "warmup_max_lr": 3e-5,
              "warmup_num_steps": 500
          }
      },

      "zero_optimization": {
          "stage": 2,
          "offload_optimizer": {
              "device": "cpu",
              "pin_memory": true
          },
          "allgather_partitions": true,
          "allgather_bucket_size": 2e8,
          "overlap_comm": true,
          "reduce_scatter": true,
          "reduce_bucket_size": 2e8,
          "contiguous_gradients": true
      },

      "steps_per_print": 2000,
      "wall_clock_breakdown": false
  }

运行脚本.sh:

 MASTER_PORT=65534
  deepspeed --hostfile hostfile --num_gpus=2 --master_port 65534 main.py \
      --do_train \
      --train_file dataset/AdvertiseGen/train.json \
      --validation_file dataset/AdvertiseGen/dev.json \
      --preprocessing_num_workers 2 \
      --prompt_column content \
      --response_column summary \
      --overwrite_cache \
      --model_name_or_path /data/ChatGLM2-6B/ \
      --output_dir output/adgen-chatglm2-6b-lora_version \
      --overwrite_output_dir \
      --max_source_length 64 \
      --max_target_length 128 \
      --per_device_train_batch_size 1 \
      --per_device_eval_batch_size 1 \
      --gradient_accumulation_steps 1 \
      --predict_with_generate \
      --max_steps 100 \
      --logging_steps 10 \
      --save_steps 100 \
      --learning_rate 2e-5 \
      --lora_r 8

请问能够使用上面的deepspeed的配置进行多机多卡微调以及如何验证每个gpu加载了多少参数呢?