How is wordvec.txt different than glove.840B.300d.txt?

andra-pumnea commented 6 years ago

The papers mentions that glove embeddings were used for word representation layer. However, when I tried to train with glove.840B.300d.txt I got the following error:

Cannot create a tensor proto whose content is larger than 2GB.

What preprocessing is applied for obtaining wordvec?

karttikeya commented 6 years ago

Tensorflow doesn't allow assigning arrays more than 2 GB as Variable. There are two ways you can work around this:

Typically, only a fraction of the total of words from the glove/word2vec embeddings is used in the model. You can extract the embeddings for the words present in your corpus beforehand and then only use these feed word embeddings in the .config file. For format of the .txt file containing the embeddings you can look here: https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view
You can work with the entire word embedding files by changing the assigning operation in the code a little. A very good answer on how to do this is provided here: https://stackoverflow.com/questions/35394103/initializing-tensorflow-variable-with-an-array-larger-than-2gb

andra-pumnea commented 6 years ago

Thank you!

zhiguowang / BiMPM

How is wordvec.txt different than glove.840B.300d.txt? #45