Closed SamTov closed 1 year ago
Very nice and clean implementation of your optimizer.
I have one bigger comment to how it re-scales. The rest are just minor things and suggestions.
I responded to that one in some detail. But in general that is a topic of research. All learning rates ignore batch size and this one does the same. We only notice it because we now care about the batch in the training dynamics. However, I can add an argument that allows for scaling over the batch or anything so that we can experiment with this very easily.
I have written in the trace normalized optimizer so that it can be called directly and we don't need to keep writing custom train loops. There is also a test written in.