yuanzhoulvpi2017 / zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
MIT License
2.93k stars 360 forks source link

我想使用deepspeed训练bloom,但发现以下错误 #118

Open fredericklee602 opened 1 year ago

fredericklee602 commented 1 year ago

pip install deepspeed 再直接sh ds_all.sh 但出现以下错误,想知道发生了什么?

zero_nlp-main/chinese_bloom$ sh ds_all.sh
[2023-06-03 12:26:34,143] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-03 12:26:34,174] [INFO] [runner.py:541:main] cmd = /home/gufonet/anaconda3/envs/LLM/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=42 --enable_each_rank_log=None train.py --model_name_or_path bigscience/bloom-7b1 --data_path data_proj/opendata --bf16 False --output_dir output_dir --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --deepspeed ./configs/default_offload_opt_param.json --tf32 False
[2023-06-03 12:26:36,098] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-03 12:26:36,098] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-03 12:26:36,098] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-03 12:26:36,098] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-03 12:26:36,098] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:42 (errno: 13 - Permission denied).
[W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/home/gufonet/SolventoSoft/zero_nlp-main/chinese_bloom/train.py", line 280, in <module>
    train()
  File "/home/gufonet/SolventoSoft/zero_nlp-main/chinese_bloom/train.py", line 237, in train
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 346, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 113, in __init__
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/training_args.py", line 1333, in __post_init__
    and (self.device.type != "cuda")
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/training_args.py", line 1697, in device
    return self._setup_devices
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/training_args.py", line 1627, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/accelerate/state.py", line 117, in __init__
    torch.distributed.init_process_group(backend="nccl", **kwargs)
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/gufonet/anaconda3/envs/LLM/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:42 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied).
[2023-06-03 12:26:41,106] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2163606
[2023-06-03 12:26:41,107] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2163607
[2023-06-03 12:26:41,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2163608
[2023-06-03 12:26:41,249] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2163609
[2023-06-03 12:26:41,333] [ERROR] [launch.py:434:sigkill_handler] ['/home/gufonet/anaconda3/envs/LLM/bin/python', '-u', 'train.py', '--local_rank=3', '--model_name_or_path', 'bigscience/bloom-7b1', '--data_path', 'data_proj/opendata', '--bf16', 'False', '--output_dir', 'output_dir', '--num_train_epochs', '3', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '8', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--deepspeed', './configs/default_offload_opt_param.json', '--tf32', 'False'] exits with return code = 1
yuanzhoulvpi2017 commented 1 year ago
  1. 目前,使用deepspeed确实是有问题,主要是在模型加载的部分,不对。具体的代码在这里line_242 需要把

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        device_map='auto',
        torch_dtype=torch.bfloat16
    
    )

    里面的device_map='auto',这行给删掉,再试一试