Open yacineMahdid opened 3 years ago
Will need to figure out how to structure the optimizer and take a step, the way my functions for optimization work right now might not be optimal.
After looking at an example of how pytorch works it seems that the way I structured it might work. I just need to have the gradient per weight and I'll be good to go.
In a nutshell this is what we will be doing
for param in model.parameters():
param -= learning_rate * param.grad
But we can wrap this around in a class like format as so:
learning_rate = 1e-3
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)
[...]
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
This means that the optimizer will have access to the model parameters as well as the gradients. The one thing that is weird in Pytorch is that the loss
as access to the model parameters.
I'll simplify this right now since I still don't have a dynamic graph solver implemented!
What I should have is something like this:
optimizer = SGD(model.parameters(), optimizer_parameters...)
[...]
optimizer.zero_grad() # this will remove all the gradients accumulated
optimizer.backward() # since it already has access to the graph and to the gradients.
optimizer.step() # do one gradient descent step
Little correction, we shouldn't have the optimizer doing the backward pass since this will only depends on the model and not on the optimizer!
We should be doing this instead:
optimizer = SGD(model.parameters(), optimizer_parameters...)
[...]
optimizer.zero_grad() # this will remove all the gradients accumulated
model.backward() # the behavior of backward will be architecture specific
optimizer.step() # do one gradient descent step
We should do a full run with the optimizer + activation + framework otherwise I'm running a bit blindly if I try to code up all the optimizer first.
Currently most of the code lives in Jupyter notebook, I should move most of the code into
.py
script so that I can reuse the code.