ziyin-dl / word-embedding-dimensionality-selection

On the Dimensionality of Word Embedding
https://nips.cc/Conferences/2018/Schedule?showEvent=12567
MIT License
329 stars 44 forks source link

does it use too much memory ? #11

Open ahhygx opened 5 years ago

ahhygx commented 5 years ago

cat config/word2vec_sample_config.yml skip_window: 5 neg_samples: 1 vocabulary_size: 100000 min_count: 100

ziyin-dl commented 5 years ago

Can you provide more detail? E.g. parameters in the config file, how much memory was consumed, etc.

ahhygx commented 5 years ago

I use vocabulary_size: 100000 and my computer has 16G . Actually, it has 500000 vocabularys in the file.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/workspace/guoxiang/lls/src/github.com/word-embedding-dimensionality-selection/main.py", line 31, in min_count=cfg.get('min_count')) File "utils/tokenizer.py", line 62, in do_index_data self.tokenized = self.tokenize(data) File "utils/tokenizer.py", line 23, in tokenize tokenized = pool.map(_lower, splitted) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get raise self._value MemoryError

ahhygx commented 5 years ago

https://github.com/ziyin-dl/word-embedding-dimensionality-selection/blob/master/utils/tokenizer.py#L62
can we pipeline the data ? The list cost extremity memony

ziyin-dl commented 5 years ago

Vocab 100000 is a bit too much for now, it will require distributed algorithms. The bottleneck will be memory and CPU since there will be a matrix of size n by n (n is the vocabulary size).

However, from the observation we found that vocab size is not very important to the dimensionality, as long as the signal-to-noise ratio characteristics is not changed. Using 10k-20k vocab should reflect the right dimensionality.

OYE93 commented 5 years ago

Hi @ziyin-dl I tried the example on text8, when I use default config, vocab==10000, the dimension I got is 123, when I changed the vocab==15000, the program use all the vocab, the vocab_size==11815, the dimension I got changes to 142. I have a question, if the dimension can change with the change of vocab_size in the same corpus, how to get a most suitable dimension for the corpus, or maybe vocab_size = corpus_size is the best choose? thanks!

ziyin-dl commented 5 years ago

Yes, the dimensionality depends on on the vocab_size, and other parameters in the config file. And this is expected; consider an extreme case where you set vocab_size=10; This means only the top 10 most frequent tokens are embedded, and all the information is essentially contained in a 10 by 10 co-occurrence matrix. The dimensionality in this case should definitely be different from using a vocab_size=10000.

In general, vocab_size, min_count, window_size, etc.. are other hyper-parameters in the embedding algorithm. There is no standard way of choosing them; for choosing the vocabulary, a usual way is to set a cut-off threshold (a.k.a min_count). If the frequency of a token is less than min_count, it will be mapped to a special token '<unk>', indicating its appearance is not enough for a stable estimation (say a token appear only once in the corpus, how can we learn a meaningful embedding for it?). I would suggest using min_count=50 or 100. On text8 corpus, min_count=100 roughly lead to a vocabulary size of 10000.

OYE93 commented 5 years ago

ok, in my understanding, the best way should be min_count was set, then you can get the number of words which appearing more than min_count, then the number should be set as vocab_size? If so, why don't set the vocab_size automatically according to min_count? When training embedding use gensim, the min_count and window_size should be set as the same value in config file.

ziyin-dl commented 5 years ago

I believe in gensim only min_count is needed. In the program, setting vocab_size=corpus_size will make sure the vocabulary is constructed using min_count only.

The purpose of adding a vocab_size parameter is to give people another way to control the vocabulary; it happens quite often (in both research and industrial applications) that people want a vocabulary with a nice-looking number (like 10000, 20000, etc.)

OYE93 commented 5 years ago

ok, I can understand the logic now, the real size of vocabulary was decided by overlap of min_count and vocab_size, for example, I set min_count as 100, vocab_size as 10000, the real size is 10000, when I set min_count as 100, vocab_size as 15000, the real size is 11815, which is the number of words appearing more than 100 times. After that, when I train the embedding, I should keep the final embedding size as same as the one got in your program, in other words, 'real size' I mentioned above. Then I can get the suitable embedding.

ziyin-dl commented 5 years ago

Right, what the code does is 1) Construct a vocabulary V1 using min_count 2) Construct a vocabulary V2 using vocab_size 3) return whichever is smaller: V1 < V2 ? V1 : V2.

OYE93 commented 5 years ago

Ok, thanks! 😃

liuhuafeiyu commented 5 years ago

I use vocabulary_size: 100000 and my computer has 16G . Actually, it has 500000 vocabularys in the file.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/workspace/guoxiang/lls/src/github.com/word-embedding-dimensionality-selection/main.py", line 31, in min_count=cfg.get('min_count')) File "utils/tokenizer.py", line 62, in do_index_data self.tokenized = self.tokenize(data) File "utils/tokenizer.py", line 23, in tokenize tokenized = pool.map(_lower, splitted) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get raise self._value MemoryError dear ahhygx: I meet the same problem as you.I used the text8 data to run the program, but there were still errors vocabulary_size=10000 Traceback (most recent call last): File "/home/li/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/li/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/li/program/word-embedding-dimensionality-selection-master/main.py", line 37, in signal_matrix.estimate_signal() File "/home/li/program/word-embedding-dimensionality-selection-master/matrix/signal_matrix.py", line 39, in estimate_signal matrix = self.construct_matrix(self.corpus) File "/home/li/program/word-embedding-dimensionality-selection-master/matrix/word2vec_matrix.py", line 60, in construct_matrix PMI = np.log(Pij) - np.log(np.outer(Pi, Pi)) - np.log(k) Memory error I want to know why ?In fact, the data of text8 is not very large. Why are there memory errors? Have you solved the problem yet?Looking forward to hearing from you