Open orchidmajumder opened 7 years ago
This explanation definitely needs to be clear. Thanks for pointing it out. I would love for you to take a stab at improving the explanation, and I can always make a pass afterwards and revise the text
Sure thanks a lot. I'll do that.
Still interested in taking a swing here or should I do it?
Apologies for not updating on this. Let me try to do it over the weekend and if I could not manage time to finish it, you can take over.
Here is a github-gist link with my first attempt to improve it: https://gist.github.com/orchidmajumder/68fc965cb3e38f8b0daa7fec96285b63
Please let me know if the approach looks fine, I'd then raise a PR to incorporate any minor feedback that we might think relevant.
@zackchase can you please take a look at it? Or if you can suggest someone who can take a look?
@orchidmajumder Hi, I have read your improved version of the page. The section on "Head gradients and the chain rule" looks much more elaborate, but the code example demonstrating the use of head_gradient has remained untouched. I believe there is a confusion in notation, between the text describing the chain rule (which in itself is consistent) and the code in block [7]. In particular, the function y(x) in the text is the function z(x) in the code (the internal function y(x) in the code is a misfortunate complication here, to my opinion), and the head gradient that z.backward() admits here is not dz/dx (this is the internal gradient), but some dg/dz, the gradient that is passed back to z from a later stage. Please re-read the last part (starting from ".. sometimes when we call the backward ..." ) and see that it is compliant with the following block [7].
In the documentation section for Head gradient and the chain rule, I think it might be better to explain the context behind head gradient in a bit more detailed way. Like if we refer to the class-notes for CS231N, it explains back-prop with a notion of incoming gradient (gradient on its output) and local gradient in the Intuitive understanding of backpropagation section. If I am correct, the incoming gradient is what is referred as head gradient and I believe if we add that explanation in the documentation, it might be more intuitive to the readers.
Please let me know if my understanding is correct, I will update the documentation and raise a pull request.