zjpbinary / CSCBLI

Code for the ACL2021 paper "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction"
14 stars 1 forks source link

How are embedding sizes made equal? #3

Closed CharithaRathnayake closed 1 year ago

CharithaRathnayake commented 1 year ago

Is there a specific technique used to ensure that the sizes of static target and source embeddings and XLMR target and source embeddings are equal, or is it simply a matter of trimming?

zjpbinary commented 1 year ago

Thanks for your question. The steps we take to obtain static word embeddings and contextual representations are as follows:

  1. We use WikiExtractor to extract plain text from Wikipedia dumps and use these corpora to train static word embeddings by using fastText.
  2. We restrict the dictionary induction process to the 20,000 most frequent words, corresponding to the top 20,000 fastText embeddings. (you may want to have a look at it too) -> https://arxiv.org/abs/2805.06297
  3. For each of the 20,000 words, we randomly sample k sentences containing the word from Wiki corpora, and then extract its contextual representation as described in our paper, so the sizes of static embeddings and contextual representations are equal.

Hope this helps!

CharithaRathnayake commented 1 year ago

Thank you for your reply.