Open renyiryry opened 5 years ago
True. The root of this problem is the misspecified loss being used in the code. Natural gradient descent requires us to have a negative-log-likelihood loss, so we should sum the cross entropy loss over the minibatch, instead of the sum.
I agree. Also want to kindly remind you that if sum is used instead of average, when computing mini-batch gradient, it might need to be re-scaled based on the size of mini-batch size (or change the learning rate, as you did already).
In the pytorch implementation of kfac, G1_ is computed as:
G1_ = 1/m * a1.grad.t() @ a1.grad
However, the a1.grad is different from the a_1 in (1) of kfac's paper. Specifically, when you do backpropagation on the network to get a1.grad, it has a coefficient term 1/m in front of it, where m is the size of mini-batch. In other words, a1.grad = 1/m * a1 (in kfac paper). Consequently, the G1 is wrong. Similarly, G2 and G3 are also wrong.
Please correct me if I misunderstand something. Thanks!