Open nick-magnini opened 8 years ago
qvec was designed to load the whole embedding file into memory, because it makes it easier to calculate column-wise correlations. If you want to use this implementation as-is you would need a machine with enough RAM to hold the whole dataset.
I am working now on an improved version of qvec, that uses CCA-algorithm instead of sum of correlations. See qvec_cca.py, this implementation still loads everything into memory, but does not have to. It can be modified to process data on the fly. However, it requires Matlab installed to perform the actual CCA calculation. Please see if it works better for you.
The memory is actually enough. $> free -g total used free shared buffers cached Mem: 93 42 51 0 0 6
The machine has 51 G free memory. It shouldn't be the memory issue. I suspected that and that's why I ran it on a big machine.
sorry, I didn't notice your wrote in the first message that you have 51G free. However, I still think this is a memory issue because "Killed" error message is not from qvec but from your OS. Even though you tired bigger embeddings, this does not necessarily imply that the bigger file requires more memory: the data is stored in a python dictionary data structure, so if there are repeated lines or more spaces in the bigger file it might still need less memory in the python dictionary. I suggest you try in a tmux session to run qvec in one pane and in another monitor memory usage with htop command.
Well, it's surprising still. The memory depends on the number of rows and number of columns otherwise everything else should be the same. Having an embedding file with 100 dims and 10008676 unique words should take much more memory than a file with the same 10008676 unique words and 15 dims for each. Isn't it true?
Running it using gensim resolves the problem though!
Great, thanks for an update :)
On Mon, Jan 11, 2016 at 4:11 PM, nick-magnini notifications@github.com wrote:
Running it using gensim resolves the problem though!
— Reply to this email directly or view it on GitHub https://github.com/ytsvetko/qvec/issues/1#issuecomment-170690988.
As a suggestion, it is great to make your code compatible with gensim since gensim has been widely used.
@nick-magnini Thanks! It is on our gensim student project list https://github.com/RaRe-Technologies/gensim/wiki/Student-Projects#intrinisic-evaluation-of-word2vec-models-with-qvec
Hi,
Thanks for making the code available. I have an embedding model in this format:
word1 4 -2 3 1 1 1 0 -2 2 3 1 0 0 0 -3 -4 0 0 3 -4 1 -5 2 -2 0 -1 -2 0 0 1 0 0 2 2 0 3 -4 -2 0 -5 -1 1 1 2 -2 0 -2 0 -2 -3 -1 -3 0 0 -5 0 5 -2 -1 -2 0 2 0 0 0 2 5 -3 1 2 1 -3 0 1 3 0 -3 0 1 -2 2 -1 -1 0 -4 2 0 -1 0 0 -1 1 0 -5 2 0 0 0 -2 -2 word2 ...
It contains 10008676 lines and about 2.5 GB in size. I use python2.7. my running command is this: $> ./qvec-python2.7.py --in_vectors $embedding --in_oracle oracles/semcor_noun_verb.supersenses.en
After running "Loading VSM file: ....", it takes around 10-20 mins till it stops. The output after being stoped is "Killed". It can't be memory since I tired bigger embeddings and they went through. What could it be the possible reason?