Open felixhao28 opened 6 years ago
I think his code is totally different from the paper.
It is dropout applied to the Gumbel noise. Please check the README for the detail.
Thanks. Somehow I missed that part in readme.
In our experiment, we arbitrarily set p=0.5
but the loss stopped decreasing after a few epochs. Then we completely removed self.B
and then the training can continue as normal. In the end, the outputs of the LSTM gates are more skewed towards a Bernoulli distribution (0 and 1) than it did previously, but the end to end accuracy was a just little lower comparing to using plain LSTM. So my conclusion is that G2-LSTM is not a universal drop-in improvement for every task. The idea is very profound though.
Mathematically, does it even make sense to apply such dropout to the Gumbel noise? Randomly subtracting a portion from some of the population will just create two distribution.
And just out of curiosity, have you tried applying the same trick to GRU gates?
I am trying to follow your code but here is where I get lost:
What is the purpose of B? To simulate some kind of dropout for the noise? Is it mentioned in the paper somewhere?
Thanks in advance.
source: https://github.com/zhuohan123/g2-lstm/blob/master/language-modeling/g2_lstm.py#L42