zyushun / Adam-mini

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
285 stars 10 forks source link

RuntimeError: No backend type associated with device type cpu #28

Open minienglish1 opened 1 day ago

minienglish1 commented 1 day ago

Training stable diffusion XL unet using accelerate library with FSDP: fsdp_offload_params: true; fsdp_sharding_strategy: SHARD_GRAD_OP

Environment: accelerate-0.34.2 torch-2.4.1 CUDA Version: 12.4 adam_mini-1.0.3 (pip install)

Full Error: rank1: Traceback (most recent call last): rank1: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1618, in

rank1: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1296, in main

rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 159, in step rank1: self.scaler.step(self.optimizer, closure) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 454, in step rank1: retval = self._maybe_opt_step(optimizer, optimizer_state, *args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step rank1: retval = optimizer.step(*args, *kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 214, in patched_step rank1: return method(args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper rank1: return func.get(opt, opt.class)(*args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper rank1: out = func(*args, *kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank1: return func(args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/adam_mini/adam_mini.py", line 317, in step rank1: dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper rank1: return func(*args, **kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce rank1: work = group.allreduce([tensor], opts) rank1: RuntimeError: No backend type associated with device type cpu

I passed the code & error to ChatGPT o1 with the requirement "forces the communication in GPUs when GPU is available" based on your response to issue "Adam mini can't offload to CPU #3". It's response was a modifying the code as:

@torch.no_grad() def step(self, closure=None):

... your existing code ...

if (state["reduced"]):
    # Force communication over GPUs when GPUs are available
    if tmp_lr.device.type == 'cpu':
        # Move the tensor to the current GPU device
        tmp_lr_gpu = tmp_lr.to(torch.cuda.current_device())
        # Perform the all-reduce operation on the GPU tensor
        dist.all_reduce(tmp_lr_gpu, op=dist.ReduceOp.SUM)
        # Move the result back to the CPU tensor
        tmp_lr.copy_(tmp_lr_gpu.cpu())
    else:
        # Tensor is already on GPU, use NCCL backend
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)

Using this modification allowed the script to train.

Compared to AdamW, the loss was similar for the first 25 steps. ChatGPT o1 also made suggestions of batching the tensor transfer, or using GLOO for the tensors on CPU. But I trust you know your code better than ChatGPT.

Further, based on " Adam mini can't save when using with FSDP in Huggingface Trainer #5", using fsdp_use_orig_params: false, allowed the training state to be saved.

Really excited about Adam-mini being able to be used with FSDP with cpu_offset. Thanks for all your hard work on this!

zyushun commented 1 day ago

@minienglish1 Thanks for the kind words and support!

Your suggestion seem to be a great fix. We will test on our side and will update in Pypi.

We will keep you updated here.

minienglish1 commented 1 day ago

Thanks, but the code is not mine, I copied directly from ChatGPT o1. I only verified that it worked with my training script. You should thoroughly test it.

I also tested the following modification suggested by ChatGPT o1. It appears to also work fine, and at similar speeds as the above post. Perhaps it will be of benefit to someone who finds this issue thread. Again, the code is copied directly from ChatGPT o1.

Use the GLOO Backend for CPU Tensors:

Initialize a separate process group with the GLOO backend, which supports CPU tensors, and use it for CPU-based collective operations.

Modify the Optimizer Initialization:

Add a GLOO process group in the init method of your optimizer:

python

import torch.distributed as dist

class Adam_mini(torch.optim.Optimizer): def init(self, named_parameters, **kwargs):

... your existing code ...

    # Initialize default backend and GLOO group if using NCCL
    if not dist.is_initialized():
        dist.init_process_group(backend='nccl' if torch.cuda.is_available() else 'gloo')
    self.default_backend = dist.get_backend()
    if self.default_backend == 'nccl':
        self.gloo_group = dist.new_group(backend='gloo')

Modify the step Method:

In your step method, use the GLOO group for CPU tensors:

python

@torch.no_grad() def step(self, closure=None):

... your existing code ...

if (state["reduced"]):
    # Use GLOO backend if tensor is on CPU
    if tmp_lr.device.type == 'cpu' and self.default_backend == 'nccl':
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM, group=self.gloo_group)
    else:
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)

Explanation:

GLOO Backend: GLOO supports both CPU and GPU tensors, making it suitable for CPU operations.
Separate Process Group: By creating a new process group with GLOO, you avoid interfering with the existing NCCL-based group used for GPU operations.
Conditional All-Reduce: The code checks the device type of tmp_lr and uses the appropriate backend.
zyushun commented 1 day ago

@minienglish1 Hi, I think your change can work in general FSDP offload cases. We have merged your changes (with some minor changes) into our Adam-mini version 1.0.4.

Also updated in PyPI. You can try pip install adam-mini again and use the latest version.

Thanks for your great suggestions! We expressed our gratitude to you in the acknowledgment. :D