zhiyuanhubj / LongRecipe

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
https://arxiv.org/abs/2409.00509
66 stars 4 forks source link

多机多卡设置 #5

Open 233function opened 3 weeks ago

233function commented 3 weeks ago

您好!我在64卡上外推72b模型时一直遇到OOM的问题,是不是multi_node.yaml中配置错了? multi_node.yaml debug: false deepspeed_config: deepspeed_config_file: utils/accelerate_configs/zero3_offload.json deepspeed_multinode_launcher: standard zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' num_processes: 128 num_machines: 128 main_training_function: main rdzv_backend: c10d same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

zero3_offload.json { "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto" }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 5e-5, "warmup_num_steps": 0, "warmup_type": "linear" } }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false }

zhiyuanhubj commented 2 days ago

Hello, sorry for the late response. We haven't fully test the yaml for multi-nodes training. But we currently are working in training the LLM with 512k context window which requires the multi-nodes training. We will release the script within two weeks.