zyushun / Adam-mini

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
258 stars 9 forks source link

Fix code that causes UnboundLocalError #9

Closed Mrw33554432 closed 1 month ago

Mrw33554432 commented 1 month ago

The current code is causing an UnboundLocalError

...Adam_mini.py", line 289, in step
    state["vmean"] = torch.zeros_like(torch.sum(grad * grad)).to(
                                                ^^^^
UnboundLocalError: cannot access local variable 'grad' where it is not associated with a value

so I modify it to be

torch.zeros_like(torch.sum(p.data * p.data))

for both the vmean and temp_lr in initialization. The it works fine in my case.

Ideally they should work the same, but I am not quite sure about the DTensor related problem you mentioned in comments. Please take a look, ty.

zyushun commented 1 month ago

Hi! Thanks for mentioning this! I just tested that your proposed solution is compatible with DTensor. Thanks for the great fix and I have merged the pull request!

awgu commented 1 month ago

@zyushun Just sharing some PyTorch details: DTensor is a "wrapper tensor subclass" ("subclass" as in DTensor inherits from torch.Tensor and "wrapper" as in, it uses torch.Tensor._make_wrapper_subclass and wraps a ._local_tensor).

A common pattern in legacy/existing optimizer code is to call things like p.data. The usage there of .data is actually that of Tensor.detach() e.g. since the parameter is a leaf tensor that is getting modified in-place. If the optimizer is not differentiable, then we could equivalently wrap step() with torch.no_grad() and not need the .data calls, or we could use p.detach() instead.

Another kind of .data usage is assignment like p.data = ..., which changes the tensor data without changing the Python object.

For DTensor, I think calling dtensor.data to mean dtensor.detach() is fine and should work like you saw when you tested; however, dtensor.data = ... assignment does not work.