does it use too much memory ?

ahhygx commented 5 years ago

cat config/word2vec_sample_config.yml skip_window: 5 neg_samples: 1 vocabulary_size: 100000 min_count: 100

ziyin-dl commented 5 years ago

Can you provide more detail? E.g. parameters in the config file, how much memory was consumed, etc.

ahhygx commented 5 years ago

I use vocabulary_size: 100000 and my computer has 16G . Actually, it has 500000 vocabularys in the file.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/workspace/guoxiang/lls/src/github.com/word-embedding-dimensionality-selection/main.py", line 31, in min_count=cfg.get('min_count')) File "utils/tokenizer.py", line 62, in do_index_data self.tokenized = self.tokenize(data) File "utils/tokenizer.py", line 23, in tokenize tokenized = pool.map(_lower, splitted) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get raise self._value MemoryError

ahhygx commented 5 years ago

https://github.com/ziyin-dl/word-embedding-dimensionality-selection/blob/master/utils/tokenizer.py#L62
can we pipeline the data ? The list cost extremity memony

ziyin-dl commented 5 years ago

Vocab 100000 is a bit too much for now, it will require distributed algorithms. The bottleneck will be memory and CPU since there will be a matrix of size n by n (n is the vocabulary size).

However, from the observation we found that vocab size is not very important to the dimensionality, as long as the signal-to-noise ratio characteristics is not changed. Using 10k-20k vocab should reflect the right dimensionality.

OYE93 commented 5 years ago

Hi @ziyin-dl I tried the example on text8, when I use default config, vocab==10000, the dimension I got is 123, when I changed the vocab==15000, the program use all the vocab, the vocab_size==11815, the dimension I got changes to 142. I have a question, if the dimension can change with the change of vocab_size in the same corpus, how to get a most suitable dimension for the corpus, or maybe vocab_size = corpus_size is the best choose? thanks!

ziyin-dl commented 5 years ago

Yes, the dimensionality depends on on the vocab_size, and other parameters in the config file. And this is expected; consider an extreme case where you set vocab_size=10; This means only the top 10 most frequent tokens are embedded, and all the information is essentially contained in a 10 by 10 co-occurrence matrix. The dimensionality in this case should definitely be different from using a vocab_size=10000.

In general, vocab_size, min_count, window_size, etc.. are other hyper-parameters in the embedding algorithm. There is no standard way of choosing them; for choosing the vocabulary, a usual way is to set a cut-off threshold (a.k.a min_count). If the frequency of a token is less than min_count, it will be mapped to a special token '<unk>', indicating its appearance is not enough for a stable estimation (say a token appear only once in the corpus, how can we learn a meaningful embedding for it?). I would suggest using min_count=50 or 100. On text8 corpus, min_count=100 roughly lead to a vocabulary size of 10000.

OYE93 commented 5 years ago

ok, in my understanding, the best way should be min_count was set, then you can get the number of words which appearing more than min_count, then the number should be set as vocab_size? If so, why don't set the vocab_size automatically according to min_count? When training embedding use gensim, the min_count and window_size should be set as the same value in config file.

ziyin-dl commented 5 years ago

I believe in gensim only min_count is needed. In the program, setting vocab_size=corpus_size will make sure the vocabulary is constructed using min_count only.

The purpose of adding a vocab_size parameter is to give people another way to control the vocabulary; it happens quite often (in both research and industrial applications) that people want a vocabulary with a nice-looking number (like 10000, 20000, etc.)

OYE93 commented 5 years ago

ok, I can understand the logic now, the real size of vocabulary was decided by overlap of min_count and vocab_size, for example, I set min_count as 100, vocab_size as 10000, the real size is 10000, when I set min_count as 100, vocab_size as 15000, the real size is 11815, which is the number of words appearing more than 100 times. After that, when I train the embedding, I should keep the final embedding size as same as the one got in your program, in other words, 'real size' I mentioned above. Then I can get the suitable embedding.

ziyin-dl commented 5 years ago

Right, what the code does is 1) Construct a vocabulary V1 using min_count 2) Construct a vocabulary V2 using vocab_size 3) return whichever is smaller: V1 < V2 ? V1 : V2.

OYE93 commented 5 years ago

Ok, thanks! 😃

liuhuafeiyu commented 5 years ago

I use vocabulary_size: 100000 and my computer has 16G . Actually, it has 500000 vocabularys in the file.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/workspace/guoxiang/lls/src/github.com/word-embedding-dimensionality-selection/main.py", line 31, in min_count=cfg.get('min_count')) File "utils/tokenizer.py", line 62, in do_index_data self.tokenized = self.tokenize(data) File "utils/tokenizer.py", line 23, in tokenize tokenized = pool.map(_lower, splitted) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get raise self._value MemoryError dear ahhygx: I meet the same problem as you.I used the text8 data to run the program, but there were still errors vocabulary_size=10000 Traceback (most recent call last): File "/home/li/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/li/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/li/program/word-embedding-dimensionality-selection-master/main.py", line 37, in signal_matrix.estimate_signal() File "/home/li/program/word-embedding-dimensionality-selection-master/matrix/signal_matrix.py", line 39, in estimate_signal matrix = self.construct_matrix(self.corpus) File "/home/li/program/word-embedding-dimensionality-selection-master/matrix/word2vec_matrix.py", line 60, in construct_matrix PMI = np.log(Pij) - np.log(np.outer(Pi, Pi)) - np.log(k) Memory error I want to know why ？In fact, the data of text8 is not very large. Why are there memory errors? Have you solved the problem yet?Looking forward to hearing from you

ziyin-dl / word-embedding-dimensionality-selection

does it use too much memory ? #11