Closed jmarshrossney closed 3 years ago
No rush whatsoever!
Ok the CI is failing for reasons to do with the Anaconda API, rather than our tests failing.
Using Anaconda API: https://api.anaconda.org
Error: ('Authorization header was not given', 401)
Error: Process completed with exit code 1.
I'm probably being very dumb but couldn't figure out how to fix this.
Fixed a bug in loading from checkpoints. Details are here. Basically, when we instantiate the scheduler, which we were previously doing after loading the optimizer state dict, this modifies (resets) the learning rate in the optimizer in a way that is not undone when we subsequently load the scheduler's state dict. The reason is that step()
is called during instantiation.
I didn't catch this until just now because we have been using CosineAnnealingWarmRestarts
as our scheduler (for all results in 2105.12481v1, thankfully), which is not affected by this in practice due to the way learning rates are calculated. However, for recent experiments I started using CosineAnnealingLR
instead, which calculates the new learning rate using information about the current learning rate, which has been modified (again, see issue above for details). The result is that restarting from a checkpoint was erroneously resetting the learning rate to its max value.
This is easily fixed by loading both optimizer and scheduler after instantiation.
But what a lesson in why one should monitor metrics and hyperparameters by default!
Based on comments on 2105.12481v1, I've made some changes.
Chiefly:
Geometry2D
classLegacyAffineLayer
LegacyEquivariantSplineLayer
which does the same thing and can still be used.I have also added a couple more tests and example runcards. I had intended to add more plotting functionality, but I don't think this is necessary at this point.