williamleif / socialsent

Code and data for inducing domain-specific sentiment lexicons.
Apache License 2.0
195 stars 75 forks source link

Memory issues for network construction (i.e. nearest neighbor computation) #2

Open williamleif opened 8 years ago

williamleif commented 8 years ago

Hi Will,

Back to you with some memory issues. My experience so far is that SocialSent runs into memory problem when you reach a threshold of more or less 7000 words to score. So I ran it on a distributed architecture (shartcnet) with 38000 words to score and ask for 16G memory, yet it very soon runs out of memory again:

... Using Theano backend. /opt/sharcnet/python/2.7.8/intel/lib/python2.7/site-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead DeprecationWarning) Evaluating SentProp with 100 dimensional GloVe embeddings Evaluating binary and continuous classification performance LEXICON SEEDS EMBEDDINGS EVAL_WORDS Traceback (most recent call last): File "concreteness.py", line 95, in sym=True, arccos=True) File "/home/genereum/socialsent-master/polarity_induction_methods.py", line 99, in random_walk M = transition_matrix(embeddings, **kwargs) File "/home/genereum/socialsent-master/graph_construction.py", line 62, in transition_matrix return Dinv.dot(L).dot(Dinv) MemoryError --- SharcNET Job Epilogue --- job id: 12138822 exit status: 1 cpu time: 313s / 12.0h (0 %) elapsed time: 479s / 12.0h (1 %) virtual memory: 11.9G / 16.0G (74 %)

Job returned with status 1. WARNING: Job only used 1 % of its requested walltime. WARNING: Job only used 0 % of its requested cpu time. WARNING: Job only used 65 % of allocated cpu time. WARNING: Job only used 74% of its requested memory. ...

A solution would be to run it 7000 words at time. But maybe you know a way to increase the memory use by the program?

Thanks, Michel

williamleif commented 8 years ago

Numpy can't natively handle or distribute large matrix computations that are needed. I think the solution is to write some cython/c code to handle the Dinv.dot(L).dot(Dinv) computation.