zyushun / Adam-mini

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
257 stars 9 forks source link

Adam mini can't offload to CPU #3

Closed hahuyhoang411 closed 1 month ago

hahuyhoang411 commented 1 month ago

I'm using accelerate launch to use FSDP with Adam mini with the latest update. But looks like it doesn't support CPU offload. Any helps? Thank you

[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/train.py", line 222, in <module>
[rank1]:     trainer_stats = trainer.train()
[rank1]:                     ^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 440, in train
[rank1]:     output = super().train(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1885, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]:     self.optimizer.step()
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step
[rank1]:     self.optimizer.step(closure)
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank1]:     return wrapped(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/Adam_mini.py", line 228, in step
[rank1]:     dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank1]:     work = group.allreduce([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: No backend type associated with device type cpu
chcoliang commented 1 month ago

Hi, Thank you for mentioning it. It may be caused by the backend NCCL not supporting CPU communication. We have updated the Adam_mini.py which forces the communication in GPUs when GPU is available. We hope it will solve the issue.

hahuyhoang411 commented 1 month ago

Great the fix works well. Thank you