wiseodd / natural-gradients

Collection of algorithms for approximating Fisher Information Matrix for Natural Gradient (and second order method in general)
https://wiseodd.github.io
BSD 3-Clause "New" or "Revised" License
135 stars 31 forks source link

possible bug in kfac #2

Open renyiryry opened 5 years ago

renyiryry commented 5 years ago

In the pytorch implementation of kfac, G1_ is computed as:

G1_ = 1/m * a1.grad.t() @ a1.grad

However, the a1.grad is different from the a_1 in (1) of kfac's paper. Specifically, when you do backpropagation on the network to get a1.grad, it has a coefficient term 1/m in front of it, where m is the size of mini-batch. In other words, a1.grad = 1/m * a1 (in kfac paper). Consequently, the G1 is wrong. Similarly, G2 and G3 are also wrong.

Please correct me if I misunderstand something. Thanks!

wiseodd commented 5 years ago

True. The root of this problem is the misspecified loss being used in the code. Natural gradient descent requires us to have a negative-log-likelihood loss, so we should sum the cross entropy loss over the minibatch, instead of the sum.

renyiryry commented 5 years ago

I agree. Also want to kindly remind you that if sum is used instead of average, when computing mini-batch gradient, it might need to be re-scaled based on the size of mini-batch size (or change the learning rate, as you did already).