Open Priyansh2 opened 5 years ago
Thank you for your interest.
Will this work with fastext ?
If I remember correctly, fasttext is essentially skip-gram on top of character-level n-grams. So I believe it should apply out of the box. What needs to be done is the way how tokenization works: given a piece of text, we should tokenize it to the character n-grams, before feeding it to the algorithm. So I think a new tokenizer should suffice (the current tokenizer is in utils/tokenizer.py
)
Feel free to submit a PR if you get it work on fasttext.
Also, can you shed some light on how to edit the files for other embeddings.
Other embedding
should be an algorithm that uses matrix factorization (implicitly or explicitly, e.g. Word2Vec, GloVe or LSA). For such an algorithm, inherit SignalMatrix
(in matrix/signal_matrix
) and implement construct_matrix()
function. Nothing else is needed.
@zi-yin So for fasttext i need to change the tokenizer.py
as you told ?
That's right. Change the tokenizer to use character ngram and run the
algorithm using --algorithm word2vec
should work
On Fri, Feb 22, 2019, 21:38 priyansh agrawal notifications@github.com wrote:
@zi-yin https://github.com/zi-yin So for fasttext i need to change the tokenizer.py as you told ?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ziyin-dl/word-embedding-dimensionality-selection/issues/15#issuecomment-466619270, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6E0DqHFfue_uKOxPiH_1xGbjRo8kLQks5vQNPvgaJpZM4bJHtQ .
A beautiful work by you. Hope to see similar work for other types of embeddings like contextual word embeddings. Will this work with fastext ? If no, what files I have to edit. Also, can you shed some light on how to edit the files for other embeddings.