ytsvetko / qvec

Intrinsic evaluation of word vectors
Open Data Commons Open Database License v1.0
75 stars 17 forks source link

Killed process #1

Open nick-magnini opened 8 years ago

nick-magnini commented 8 years ago

Hi,

Thanks for making the code available. I have an embedding model in this format:

word1 4 -2 3 1 1 1 0 -2 2 3 1 0 0 0 -3 -4 0 0 3 -4 1 -5 2 -2 0 -1 -2 0 0 1 0 0 2 2 0 3 -4 -2 0 -5 -1 1 1 2 -2 0 -2 0 -2 -3 -1 -3 0 0 -5 0 5 -2 -1 -2 0 2 0 0 0 2 5 -3 1 2 1 -3 0 1 3 0 -3 0 1 -2 2 -1 -1 0 -4 2 0 -1 0 0 -1 1 0 -5 2 0 0 0 -2 -2 word2 ...

It contains 10008676 lines and about 2.5 GB in size. I use python2.7. my running command is this: $> ./qvec-python2.7.py --in_vectors $embedding --in_oracle oracles/semcor_noun_verb.supersenses.en

After running "Loading VSM file: ....", it takes around 10-20 mins till it stops. The output after being stoped is "Killed". It can't be memory since I tired bigger embeddings and they went through. What could it be the possible reason?

ytsvetko commented 8 years ago

qvec was designed to load the whole embedding file into memory, because it makes it easier to calculate column-wise correlations. If you want to use this implementation as-is you would need a machine with enough RAM to hold the whole dataset.

I am working now on an improved version of qvec, that uses CCA-algorithm instead of sum of correlations. See qvec_cca.py, this implementation still loads everything into memory, but does not have to. It can be modified to process data on the fly. However, it requires Matlab installed to perform the actual CCA calculation. Please see if it works better for you.

nick-magnini commented 8 years ago

The memory is actually enough. $> free -g total used free shared buffers cached Mem: 93 42 51 0 0 6

The machine has 51 G free memory. It shouldn't be the memory issue. I suspected that and that's why I ran it on a big machine.

ytsvetko commented 8 years ago

sorry, I didn't notice your wrote in the first message that you have 51G free. However, I still think this is a memory issue because "Killed" error message is not from qvec but from your OS. Even though you tired bigger embeddings, this does not necessarily imply that the bigger file requires more memory: the data is stored in a python dictionary data structure, so if there are repeated lines or more spaces in the bigger file it might still need less memory in the python dictionary. I suggest you try in a tmux session to run qvec in one pane and in another monitor memory usage with htop command.  

nick-magnini commented 8 years ago

Well, it's surprising still. The memory depends on the number of rows and number of columns otherwise everything else should be the same. Having an embedding file with 100 dims and 10008676 unique words should take much more memory than a file with the same 10008676 unique words and 15 dims for each. Isn't it true?

nick-magnini commented 8 years ago

Running it using gensim resolves the problem though!

ytsvetko commented 8 years ago

Great, thanks for an update :)

On Mon, Jan 11, 2016 at 4:11 PM, nick-magnini notifications@github.com wrote:

Running it using gensim resolves the problem though!

— Reply to this email directly or view it on GitHub https://github.com/ytsvetko/qvec/issues/1#issuecomment-170690988.

nick-magnini commented 8 years ago

As a suggestion, it is great to make your code compatible with gensim since gensim has been widely used.

tmylk commented 8 years ago

@nick-magnini Thanks! It is on our gensim student project list https://github.com/RaRe-Technologies/gensim/wiki/Student-Projects#intrinisic-evaluation-of-word2vec-models-with-qvec