williamleif / histwords

Collection of tools for building diachronic/historical word vectors
http://nlp.stanford.edu/projects/histwords/
Apache License 2.0
420 stars 92 forks source link

Zero-valued vectors? #12

Open nishanthsanjeev opened 3 years ago

nishanthsanjeev commented 3 years ago

Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)

For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.

However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.

Would treating these words as simply 'missing' from the corpus at this particular decade be apt?

baumanno commented 2 years ago

I was recently confused by this myself, and think I may have found an answer. Appendix A of the paper states (emphasis mine):

For the Google datasets we built models using the top-100000 words by their average frequency over the entire historical time-periods, [...]

My interpretation is that the vocabularies across all decades contain the same 100.000 words, and zero-valued vectors indicate that no embedding for these particular words were found because the words don't appear in the corpus of that decade.

That first assumption is quickly confirmed:

vocabs = []
for decade in range(1800, 2000, 10):
    with open(f"./sgns/{decade}-vocab.pkl", 'rb') as f:
        # set semantics enable comparisons without having to sort lists manually
        vocabs.append(set(pickle.load(f)))

# union of first set with all others, serves as a point of reference in the comparison below
u = vocabs[0].union(*vocabs[1:])

all([u == v for v in vocabs])

Confirming the second assumption may be a little more involved, but my take-away is that if you're viewing the data synchronic, it should be safe to drop the zero-valued vectors. If you need a diachronic view, as done in the paper, you should not drop anything.