wojzaremba / lstm

Apache License 2.0
663 stars 284 forks source link

questions about g_cloneManyTimes #14

Closed harrywy closed 7 years ago

harrywy commented 9 years ago

Hi,

I have some questions about the function:

g_cloneManyTimes

I understand that this expand the LSTM unit through time However, does the clone operation really needed?

i.e can I do something like this: in function setup
model.rnns = core_core_network

and

in function fp model.err[i], model.s[i] = unpack(model.rnns:forward({x, y, s}))

Best,

wojzaremba commented 9 years ago

It's needed. Layers store intermediate values of computation during forward pass, and this values are reused during backward pass. If you won't clone things, then every step would override values from the previous step.

You can check what I am saying by gradient checking

jaseleephd commented 8 years ago

Hi Wojciech, I noticed your g_cloneManyTimes() function makes a deep copy of the network , giving you seq_length copies of the network with no parameter sharing. Don't we want to share the parameters of the LSTM (e.g. i2h, h2h, etc)?

During forward pass you're passing intermediate hidden states from one LSTM to the next, but the neighbouring LSTMs have different parameter values, so essentially this is a feedforward network of depth seq_length with a connection from the last layer back to the first. Am I missing something? Would appreciate a clarification.

Cheers

xuewei4d commented 7 years ago

@jasonleeinf No. In g_cloneManyTimes, new parameters just set a new view of parameters, not have their own storage.

BigNewbiePlus commented 7 years ago

hi, @wojzaremba In the fucntion g_cloneManyTimes, this code snippet cloneParams[i]:set(params[i]) cloneGradParams[i]:set(gradParams[i])is great, just as @xuewei4d explained , it just set a new view, alse mean reference. I'm confused with the second code cloneGradParams[i]:set(gradParams[i]).you said the clone is to avoid override, but when back propagate, rnns[i] and rnns[i+1] reference the same gradParams, isn't it overrid? the rnns[i+1] will override the rnns[i]'s gradParams. if it is, then the clone is meaningless. Am I missing something? Would appreciate a clarification.

wojzaremba commented 7 years ago

rnns[i] and rnns[i+1] reference the same gradParams. However, backpropagation in torch adds values rather than overriding them. Therefore, we end up having a sum over the time.

BigNewbiePlus commented 7 years ago

@wojzaremba thanks for soon reply, you mean when backward the gradParams cumulate rather than override, that is gradParams += newGradeParams not gradParams = newGradeParams. so you shrink the paramdx iff model.norm_dw > params.max_grad_norm and finally unpdate paramx with paramx:add(paramdx:mul(-params.lr)). what i understand is Right?. if so, this paramxupdate work shouldn't be done underlying by Torch7?

BigNewbiePlus commented 7 years ago

I have seen this function accGradParameters(input, gradOutput, scale) , it says

The module is expected to accumulate the gradients with respect to the parameters in some variable

yeah,this is as @wojzaremba said, it accumulates, very grateful to @wojzaremba .

Last question about why use paramx:add(paramdx:mul(-params.lr)) to update the paramx . To my opinion, this update work should already be done when using model.rnns[i]:backward({x, y, s},{derr, model.ds})[3]. That is to say, when we build a model (also known as compute graph) and call forward and backward function, the parameters we trainned should update underlying, why @wojzaremba repeate again? When I removed paramx:add(paramdx:mul(-params.lr)), the perplexity keep unchanged, why?

wojzaremba commented 7 years ago

I don't think that :backward is updating parameters. It only computes paramdx. There is a way to plug custom optimizers like adam, rmsprop to update paramx based on paramdx. However, here we do it manually. Line paramx:add(paramdx:mul(-params.lr)) is just SGD.

BigNewbiePlus commented 7 years ago

OK, I got it. backward only compute paramdx. Maybe I should read the Torch7 doc seriously for further understanding of the underlying mechanism of compute graph, especially the forward and backward function. Very Appreciate for your awesome answer!

BigNewbiePlus commented 7 years ago

I read A Note on Sharing Parameters in Torch7 doc, it explains how to update paramx when sharing parameters.