weka511 / nlp

My experiments with Natural Language Processing. I've created a few programs to try out concepts.
GNU General Public License v3.0
1 stars 0 forks source link

Word2vec2: build vocabulary appears greedy #39

Open weka511 opened 1 year ago

weka511 commented 1 year ago

It claims to have stored 1,209,358 words from blogs.zip. It takes 206 minutes, and occupies 38M of disk.

weka511 commented 1 year ago

I've found some junk in vocabulary file (among the real words): < 682787 post 745958

730051 26 4060 .... 163880

I have confirmed that there really are 1,209,358 words in vocabulary be reading it back

weka511 commented 1 year ago

Saved vocabulary of 1209335 words to ./data\blogs.npz Elapsed Time 200 m 16.08 s