Closed elmarhaussmann closed 8 years ago
LSTM and GRU have similar performance (e.g. see http://arxiv.org/pdf/1412.3555.pdf ). However GRU requires less matrices and therefore should be more stable for ASGD. So you can try to use it instead of LSTM: --hidden-type gru-full
.
LSTM impelementation is not much harder, but it's just not here ;)
Yes, thanks for the hint! I realized the same in the meantime. E.g., http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf also show that GRUs are at least as good or even better than LSTMs. They also apply dropout for larger models. I was trying to get close to the state-of-the-art on PTB but couldn't get test perplexity lower than ~110. It seems dropout helps there quite a bit (for the larger models). Any plans on implementing that? ;)
Thanks for the great software! I don't think there is something comparable that is able to deal with huge vocabularies and amount of data!
LSTMs have shown superior performance for language models in many recent papers. Would it make sense to add LSTMs as an available hidden unit or is there a particular reason this isn't implemented? Would implementing it be prone to any significant hurdles/effort in the current code?