NCCL version - Githubissues

cdhx commented 2 years ago

Hi

i have installed environment in the yaml file and installed torch 1.8 follow the setting in readme

my cuda version is 11.4, it seems that it is a version conflict of NCCL, pytorch and cuda

Is my cuda version to high?

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_prefix_compwebq.cfg --run_name T5_base_prefix_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_prefix_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 acquired on .lock
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 released on .lock
INFO:filelock:Lock 140144150953768 acquired on .lock
INFO:filelock:Lock 140144150953768 released on .lock
INFO:filelock:Lock 139898741587640 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139898741587640 released on .lock
INFO:filelock:Lock 139711655354096 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139711655354096 released on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Killing subprocess 30316
Killing subprocess 30320
Killing subprocess 30321
Killing subprocess 30322
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home2/xh/.conda/envs/skg/bin/python', '-u', 'train.py', '--local_rank=3', '--seed', '2', '--cfg', 'Salesforce/T5_base_prefix_compwebq.cfg', '--run_name', 'T5_base_prefix_compwebq', '--logging_strategy', 'steps', '--logging_first_step', 'true', '--logging_steps', '4', '--evaluation_strategy', 'steps', '--eval_steps', '500', '--metric_for_best_model', 'avr', '--greater_is_better', 'true', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--load_best_model_at_end', '--gradient_accumulation_steps', '2', '--num_train_epochs', '400', '--adafactor', 'true', '--learning_rate', '5e-5', '--do_train', '--do_eval', '--do_predict', '--predict_with_generate', '--output_dir', 'output/T5_base_prefix_compwebq', '--overwrite_output_dir', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '--generation_num_beams', '4', '--generation_max_length', '128', '--input_max_length', '1024', '--ddp_find_unused_parameters', 'true']' returned non-zero exit status 1.

Process finished with exit code 1

I also tried torch 1.11+ cu113，got another error


(skg) xh@4210GPU:~/PycharmProject/UnifiedSKG$ python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17913) of binary: /home2/xh/.conda/envs/skg/bin/python
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 17914)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 17915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 17916)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 17913)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Looking forward to your reply. Thank you.

cdhx commented 2 years ago

Another information, I only use one GPU.

Timothyxxx commented 2 years ago

Indeed, we met the similar situation during our experiments on some machines(we used a lot of GPUs on different kinds of HPC). I remember we fixed that issue by installing proper PyTorch for your CUDA version.

cdhx commented 2 years ago

I will try it, thanks

cdhx commented 2 years ago

sorry for bothering, I still can not run through it

if i do not choose GPU, it seems work fine(not sure, but it ends of out of memory)

but If i choose one GPU it will got error, my torch version is 1.8.1+cu111, same as the env in readme

INFO:filelock:Lock 140672744878032 acquired on .lock
INFO:filelock:Lock 140672744878032 released on .lock
INFO:filelock:Lock 139725435957376 acquired on .lock
INFO:filelock:Lock 139725435957376 released on .lock
INFO:filelock:Lock 140361970860216 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 140361970860216 released on .lock
INFO:filelock:Lock 139971805774008 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139971805774008 released on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    cached = self.fget(obj)    
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper

  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Killing subprocess 4345
Killing subprocess 4346
Killing subprocess 4347
Killing subprocess 4348
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home2/xh/.conda/envs/skg/bin/python', '-u', 'train.py', '--local_rank=3', '--seed', '2', '--cfg', 'Salesforce/T5_base_finetune_compwebq.cfg', '--run_name', 'T5_base_finetune_compwebq', '--logging_strategy', 'steps', '--logging_first_step', 'true', '--logging_steps', '4', '--evaluation_strategy', 'steps', '--eval_steps', '500', '--metric_for_best_model', 'avr', '--greater_is_better', 'true', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--load_best_model_at_end', '--gradient_accumulation_steps', '2', '--num_train_epochs', '400', '--adafactor', 'true', '--learning_rate', '5e-5', '--do_train', '--do_eval', '--do_predict', '--predict_with_generate', '--output_dir', 'output/T5_base_finetune_compwebq', '--overwrite_output_dir', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '--generation_num_beams', '4', '--generation_max_length', '128', '--input_max_length', '1024', '--ddp_find_unused_parameters', 'true']' returned non-zero exit status 1.

Process finished with exit code 1

Another question is, does it support torch 1.11? because it got error AttributeError: module 'torch' has no attribute 'set_deterministic' when use torch 1.11+cu113, but I check it in source code and find it does have that attribute

Thanks

ChenWu98 commented 2 years ago

I think it is a PyTorch version issue. My personal experience is that removing the following lines works for other PyTorch versions. https://github.com/HKUNLP/UnifiedSKG/blob/fab45fea3a349c9dbda4ed34482df227920272db/train.py#L27-L29

ChenWu98 commented 2 years ago

It may sacrifice reproducibility, if it is not your main concern.

cdhx commented 2 years ago

thanks for your reply, but it still not work. this error occurs only when i chose a single GPU and it the error message seems like it happens because distributed training?

ChenWu98 commented 2 years ago

Ohh the above answer is for your second question. After removing the three lines, does torch 1.11+cu113 work? For your first question, we are still exploring.

cdhx commented 2 years ago

Ohh the above answer is for your second question. After removing the three lines, does torch 1.11+cu113 work? For your first question, we are still exploring.

it does not work, this is the error log of single GPU

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:filelock:Lock 140395078486112 acquired on .lock
INFO:filelock:Lock 140395078486112 released on .lock
INFO:filelock:Lock 140049714288456 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 610, in init_process_group
    timeout=timeout,
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 738, in _new_process_group_helper
    pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 140049714288456 released on .lock
INFO:filelock:Lock 140359993235328 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 610, in init_process_group
    timeout=timeout,
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 738, in _new_process_group_helper
    pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7061 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7062 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7067 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 7066) of binary: /home2/xh/.conda/envs/skg/bin/python
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-16_11:27:44
  host      : 4210GPU
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 7066)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Process finished with exit code 1

cdhx commented 2 years ago

here is the log when do not choose GPU, it seems work fine

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:filelock:Lock 140516411049744 acquired on .lock
INFO:filelock:Lock 140516411049744 released on .lock
INFO:filelock:Lock 140004243490688 acquired on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:filelock:Lock 140004243490688 released on .lock
INFO:filelock:Lock 139957069264600 acquired on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:filelock:Lock 139957069264600 released on .lock
INFO:filelock:Lock 140293807781760 acquired on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:filelock:Lock 140293807781760 released on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
task_args.bert.location: t5-base
task_args.bert.location: t5-base
task_args.bert.location: t5-base
WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
  0%|                                                     | 0/3 [00:00<?, ?it/s]WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 782.23it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 814.48it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 596.74it/s]
wandb: Currently logged in as: myproject (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.12.11 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.1
wandb: Syncing run T5_base_finetune_compwebq
wandb: ⭐️ View project at https://wandb.ai/myproject/skg
wandb: 🚀 View run at https://wandb.ai/myproject/skg/runs/2n7lsqyq
wandb: Run data is saved locally in /home2/xh/PycharmProject/UnifiedSKG/wandb/run-20220316_114350-2n7lsqyq
wandb: Run `wandb offline` to turn off syncing.

task_args.bert.location: t5-base
WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 923.04it/s]
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Trainer build successfully.
Trainer build successfully.
Trainer build successfully.
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 146, in main
    callbacks=[early_stopping_callback],
  File "/home2/xh/PycharmProject/UnifiedSKG/utils/trainer.py", line 50, in __init__
    super().__init__(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/trainer.py", line 367, in __init__
    self._move_model_to_device(model, args.device)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/trainer.py", line 509, in _move_model_to_device
    model = model.to(device)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

ChenWu98 commented 2 years ago

Here is a minimal example for distributed training: https://towardsdatascience.com/distributed-neural-network-training-in-pytorch-5e766e2a9e62 Could you verify if it works on your machine?

neel04 commented 1 year ago

Resetting the PORT and RDVZ_ID works for me. I think multiple runs with those same parameters collide? I'm not really sure here

xlang-ai / UnifiedSKG

NCCL version #12