About the hierarchical softmax

Hi,

Thanks for the great model, and happy new year.

I would like to ask about your hierarchical softmax. Is it your intention to equally share the words to the cluster, or to make the implementation easier. I find it hard to understand the way you distribute the words to clusters, did you use a normal distribution ? I tried to group words based on their unigram frequencies (like in Mikolov's model) but the result is very bad.

Also, I guess you have also tried fbnn HSM. I tried to apply it on top of the network (after the final dropout), but it gives very huge loss. Is it possible to improve your HSM to make it work better with asynchronous clusters (some may have several words, while some have a lot of words).

Thank you,

yoonkim / lstm-char-cnn

About the hierarchical softmax #10