ziyin-dl / word-embedding-dimensionality-selection

On the Dimensionality of Word Embedding
https://nips.cc/Conferences/2018/Schedule?showEvent=12567
MIT License
329 stars 44 forks source link

Will this work on fasttext embeddings ? #15

Open Priyansh2 opened 5 years ago

Priyansh2 commented 5 years ago

A beautiful work by you. Hope to see similar work for other types of embeddings like contextual word embeddings. Will this work with fastext ? If no, what files I have to edit. Also, can you shed some light on how to edit the files for other embeddings.

zi-yin commented 5 years ago

Thank you for your interest.

Will this work with fastext ?

If I remember correctly, fasttext is essentially skip-gram on top of character-level n-grams. So I believe it should apply out of the box. What needs to be done is the way how tokenization works: given a piece of text, we should tokenize it to the character n-grams, before feeding it to the algorithm. So I think a new tokenizer should suffice (the current tokenizer is in utils/tokenizer.py)

Feel free to submit a PR if you get it work on fasttext.

Also, can you shed some light on how to edit the files for other embeddings.

Other embedding should be an algorithm that uses matrix factorization (implicitly or explicitly, e.g. Word2Vec, GloVe or LSA). For such an algorithm, inherit SignalMatrix (in matrix/signal_matrix) and implement construct_matrix() function. Nothing else is needed.

Priyansh2 commented 5 years ago

@zi-yin So for fasttext i need to change the tokenizer.py as you told ?

ziyin-dl commented 5 years ago

That's right. Change the tokenizer to use character ngram and run the algorithm using --algorithm word2vec should work

On Fri, Feb 22, 2019, 21:38 priyansh agrawal notifications@github.com wrote:

@zi-yin https://github.com/zi-yin So for fasttext i need to change the tokenizer.py as you told ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ziyin-dl/word-embedding-dimensionality-selection/issues/15#issuecomment-466619270, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6E0DqHFfue_uKOxPiH_1xGbjRo8kLQks5vQNPvgaJpZM4bJHtQ .