RuntimeError: Distributed package doesn't have NCCL built in

toyxyz commented 1 month ago

When I ran the example training script, I got the following error

HPU available: False, using: 0 HPUs
C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:7250 (system error: 10049 - The requested address is not valid in that context.).
Traceback (most recent call last):
  File "C:\Users\toyxy\ctrlora\scripts\train_ctrlora_finetune.py", line 130, in <module>
    trainer.fit(model, dataloader)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run
    self.strategy.setup_environment()
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment
    self.setup_distributed()
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed
    _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "C:\Users\toyxy\.conda\envs\ctrlora\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in

scarbain commented 1 month ago

Had the same issue. Gave the script + error to o1-mini and it fixed the error with the following script (to replace ./scripts/train_ctrlora_finetune.py) : https://pastebin.com/qQGWq7qx

toyxyz commented 1 month ago

I got another error, and changed the following code

TypeError: LatentDiffusion.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' https://github.com/xyfJASON/ctrlora/blob/177e51453f25a96fd7fa91ac361e276134f6df37/ldm/models/diffusion/ddpm.py#L591

to def on_train_batch_start(self, batch, batch_idx):

https://github.com/lllyasviel/ControlNet/issues/84#issuecomment-1434004246

toyxyz commented 1 month ago

Had the same issue. Gave the script + error to o1-mini and it fixed the error with the following script (to replace ./scripts/train_ctrlora_finetune.py) : https://pastebin.com/qQGWq7qx

Thank you!

I needed to modify the following code

default_root_dir=os.path.join("U:/Flaash/Watermark_removal/CTRLORA", args.name),

to default_root_dir=os.path.join('runs', args.name),

xyfJASON / ctrlora

RuntimeError: Distributed package doesn't have NCCL built in #8