Open minienglish1 opened 1 day ago
@minienglish1 Thanks for the kind words and support!
Your suggestion seem to be a great fix. We will test on our side and will update in Pypi.
We will keep you updated here.
Thanks, but the code is not mine, I copied directly from ChatGPT o1. I only verified that it worked with my training script. You should thoroughly test it.
I also tested the following modification suggested by ChatGPT o1. It appears to also work fine, and at similar speeds as the above post. Perhaps it will be of benefit to someone who finds this issue thread. Again, the code is copied directly from ChatGPT o1.
Use the GLOO Backend for CPU Tensors:
Initialize a separate process group with the GLOO backend, which supports CPU tensors, and use it for CPU-based collective operations.
Modify the Optimizer Initialization:
Add a GLOO process group in the init method of your optimizer:
python
import torch.distributed as dist
class Adam_mini(torch.optim.Optimizer): def init(self, named_parameters, **kwargs):
# Initialize default backend and GLOO group if using NCCL
if not dist.is_initialized():
dist.init_process_group(backend='nccl' if torch.cuda.is_available() else 'gloo')
self.default_backend = dist.get_backend()
if self.default_backend == 'nccl':
self.gloo_group = dist.new_group(backend='gloo')
Modify the step Method:
In your step method, use the GLOO group for CPU tensors:
python
@torch.no_grad() def step(self, closure=None):
if (state["reduced"]):
# Use GLOO backend if tensor is on CPU
if tmp_lr.device.type == 'cpu' and self.default_backend == 'nccl':
dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM, group=self.gloo_group)
else:
dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
Explanation:
GLOO Backend: GLOO supports both CPU and GPU tensors, making it suitable for CPU operations.
Separate Process Group: By creating a new process group with GLOO, you avoid interfering with the existing NCCL-based group used for GPU operations.
Conditional All-Reduce: The code checks the device type of tmp_lr and uses the appropriate backend.
@minienglish1 Hi, I think your change can work in general FSDP offload cases. We have merged your changes (with some minor changes) into our Adam-mini version 1.0.4.
Also updated in PyPI. You can try pip install adam-mini again and use the latest version.
Thanks for your great suggestions! We expressed our gratitude to you in the acknowledgment. :D
Training stable diffusion XL unet using accelerate library with FSDP: fsdp_offload_params: true; fsdp_sharding_strategy: SHARD_GRAD_OP
Environment: accelerate-0.34.2 torch-2.4.1 CUDA Version: 12.4 adam_mini-1.0.3 (pip install)
Full Error: rank1: Traceback (most recent call last): rank1: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1618, in
rank1: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1296, in main
rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 159, in step rank1: self.scaler.step(self.optimizer, closure) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 454, in step rank1: retval = self._maybe_opt_step(optimizer, optimizer_state, *args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step rank1: retval = optimizer.step(*args, *kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 214, in patched_step rank1: return method(args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper rank1: return func.get(opt, opt.class)(*args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper rank1: out = func(*args, *kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank1: return func(args, kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/adam_mini/adam_mini.py", line 317, in step rank1: dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper rank1: return func(*args, **kwargs) rank1: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce rank1: work = group.allreduce([tensor], opts) rank1: RuntimeError: No backend type associated with device type cpu
I passed the code & error to ChatGPT o1 with the requirement "forces the communication in GPUs when GPU is available" based on your response to issue "Adam mini can't offload to CPU #3". It's response was a modifying the code as:
@torch.no_grad() def step(self, closure=None):
... your existing code ...
Using this modification allowed the script to train.
Compared to AdamW, the loss was similar for the first 25 steps. ChatGPT o1 also made suggestions of batching the tensor transfer, or using GLOO for the tensors on CPU. But I trust you know your code better than ChatGPT.
Further, based on " Adam mini can't save when using with FSDP in Huggingface Trainer #5", using fsdp_use_orig_params: false, allowed the training state to be saved.
Really excited about Adam-mini being able to be used with FSDP with cpu_offset. Thanks for all your hard work on this!