zhuohan123 / g2-lstm

Codes for "Towards Binary-Valued Gates for Robust LSTM Training".
76 stars 12 forks source link

What is B? #1

Open felixhao28 opened 6 years ago

felixhao28 commented 6 years ago

I am trying to follow your code but here is where I get lost:

            self.B = input_.data.new(input_.size()).bernoulli_(self.p)
            self.noise = self.U * self.B

What is the purpose of B? To simulate some kind of dropout for the noise? Is it mentioned in the paper somewhere?

Thanks in advance.

source: https://github.com/zhuohan123/g2-lstm/blob/master/language-modeling/g2_lstm.py#L42

wenhuchen commented 6 years ago

I think his code is totally different from the paper.

zhuohan123 commented 6 years ago

It is dropout applied to the Gumbel noise. Please check the README for the detail.

felixhao28 commented 6 years ago

Thanks. Somehow I missed that part in readme.

In our experiment, we arbitrarily set p=0.5 but the loss stopped decreasing after a few epochs. Then we completely removed self.B and then the training can continue as normal. In the end, the outputs of the LSTM gates are more skewed towards a Bernoulli distribution (0 and 1) than it did previously, but the end to end accuracy was a just little lower comparing to using plain LSTM. So my conclusion is that G2-LSTM is not a universal drop-in improvement for every task. The idea is very profound though.

Mathematically, does it even make sense to apply such dropout to the Gumbel noise? Randomly subtracting a portion from some of the population will just create two distribution.

And just out of curiosity, have you tried applying the same trick to GRU gates?