Zero-valued vectors? - Githubissues

I was recently confused by this myself, and think I may have found an answer. Appendix A of the paper states (emphasis mine):

For the Google datasets we built models using the top-100000 words by their average frequency over the entire historical time-periods, [...]

My interpretation is that the vocabularies across all decades contain the same 100.000 words, and zero-valued vectors indicate that no embedding for these particular words were found because the words don't appear in the corpus of that decade.

That first assumption is quickly confirmed:

vocabs = []
for decade in range(1800, 2000, 10):
    with open(f"./sgns/{decade}-vocab.pkl", 'rb') as f:
        # set semantics enable comparisons without having to sort lists manually
        vocabs.append(set(pickle.load(f)))

# union of first set with all others, serves as a point of reference in the comparison below
u = vocabs[0].union(*vocabs[1:])

all([u == v for v in vocabs])

Confirming the second assumption may be a little more involved, but my take-away is that if you're viewing the data synchronic, it should be safe to drop the zero-valued vectors. If you need a diachronic view, as done in the paper, you should not drop anything.

williamleif / histwords

Zero-valued vectors? #12