Open ahhygx opened 5 years ago
Can you provide more detail? E.g. parameters in the config file, how much memory was consumed, etc.
I use vocabulary_size: 100000 and my computer has 16G . Actually, it has 500000 vocabularys in the file.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/workspace/guoxiang/lls/src/github.com/word-embedding-dimensionality-selection/main.py", line 31, in
https://github.com/ziyin-dl/word-embedding-dimensionality-selection/blob/master/utils/tokenizer.py#L62
can we pipeline the data ? The list cost extremity memony
Vocab 100000 is a bit too much for now, it will require distributed algorithms. The bottleneck will be memory and CPU since there will be a matrix of size n by n (n is the vocabulary size).
However, from the observation we found that vocab size is not very important to the dimensionality, as long as the signal-to-noise ratio characteristics is not changed. Using 10k-20k vocab should reflect the right dimensionality.
Hi @ziyin-dl I tried the example on text8, when I use default config, vocab==10000, the dimension I got is 123, when I changed the vocab==15000, the program use all the vocab, the vocab_size==11815, the dimension I got changes to 142. I have a question, if the dimension can change with the change of vocab_size in the same corpus, how to get a most suitable dimension for the corpus, or maybe vocab_size = corpus_size is the best choose? thanks!
Yes, the dimensionality depends on on the vocab_size, and other parameters in the config file. And this is expected; consider an extreme case where you set vocab_size=10; This means only the top 10 most frequent tokens are embedded, and all the information is essentially contained in a 10 by 10 co-occurrence matrix. The dimensionality in this case should definitely be different from using a vocab_size=10000.
In general, vocab_size
, min_count
, window_size
, etc.. are other hyper-parameters in the embedding algorithm. There is no standard way of choosing them; for choosing the vocabulary, a usual way is to set a cut-off threshold (a.k.a min_count
). If the frequency of a token is less than min_count
, it will be mapped to a special token '<unk>'
, indicating its appearance is not enough for a stable estimation (say a token appear only once in the corpus, how can we learn a meaningful embedding for it?). I would suggest using min_count=50
or 100
. On text8 corpus, min_count=100
roughly lead to a vocabulary size of 10000.
ok, in my understanding, the best way should be min_count
was set, then you can get the number of words which appearing more than min_count
, then the number should be set as vocab_size
? If so, why don't set the vocab_size
automatically according to min_count
? When training embedding use gensim, the min_count
and window_size
should be set as the same value in config file.
I believe in gensim only min_count
is needed. In the program, setting vocab_size=corpus_size
will make sure the vocabulary is constructed using min_count
only.
The purpose of adding a vocab_size
parameter is to give people another way to control the vocabulary; it happens quite often (in both research and industrial applications) that people want a vocabulary with a nice-looking number (like 10000, 20000, etc.)
ok, I can understand the logic now, the real size of vocabulary was decided by overlap of min_count
and vocab_size
, for example, I set min_count
as 100, vocab_size
as 10000, the real size is 10000, when I set min_count
as 100, vocab_size
as 15000, the real size is 11815, which is the number of words appearing more than 100 times. After that, when I train the embedding, I should keep the final embedding size as same as the one got in your program, in other words, 'real size' I mentioned above. Then I can get the suitable embedding.
Right, what the code does is
1) Construct a vocabulary V1 using min_count
2) Construct a vocabulary V2 using vocab_size
3) return whichever is smaller: V1 < V2 ? V1 : V2.
Ok, thanks! 😃
I use vocabulary_size: 100000 and my computer has 16G . Actually, it has 500000 vocabularys in the file.
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/workspace/guoxiang/lls/src/github.com/word-embedding-dimensionality-selection/main.py", line 31, in min_count=cfg.get('min_count')) File "utils/tokenizer.py", line 62, in do_index_data self.tokenized = self.tokenize(data) File "utils/tokenizer.py", line 23, in tokenize tokenized = pool.map(_lower, splitted) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get raise self._value MemoryError dear ahhygx: I meet the same problem as you.I used the text8 data to run the program, but there were still errors vocabulary_size=10000 Traceback (most recent call last): File "/home/li/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/li/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/li/program/word-embedding-dimensionality-selection-master/main.py", line 37, in
signal_matrix.estimate_signal() File "/home/li/program/word-embedding-dimensionality-selection-master/matrix/signal_matrix.py", line 39, in estimate_signal matrix = self.construct_matrix(self.corpus) File "/home/li/program/word-embedding-dimensionality-selection-master/matrix/word2vec_matrix.py", line 60, in construct_matrix PMI = np.log(Pij) - np.log(np.outer(Pi, Pi)) - np.log(k) Memory error I want to know why ?In fact, the data of text8 is not very large. Why are there memory errors? Have you solved the problem yet?Looking forward to hearing from you
cat config/word2vec_sample_config.yml skip_window: 5 neg_samples: 1 vocabulary_size: 100000 min_count: 100