williamleif / histwords

Collection of tools for building diachronic/historical word vectors
http://nlp.stanford.edu/projects/histwords/
Apache License 2.0
420 stars 92 forks source link

SGNS results #7

Open Alaa-Ebshihy opened 6 years ago

Alaa-Ebshihy commented 6 years ago

Hi,

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:

  1. use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
  2. use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
  3. use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
  4. use histwords/googlengram/freqperyear.py on the output of 3
  5. use histwords/googlengram/makedecades.py on the output of 3
  6. use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
  7. train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
  8. use histwords/sgns/postprocessingsgns.py on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,