Is the method suited for representing Sentence Embedding for paraphrase identification task?

Lier007 commented 5 years ago

Can two semantically similar sentences use this method to get sentence vectors closer to each other than other methods? And how to get the sentence embedding?

yumeng5 commented 5 years ago

Thanks for the question. The framework is designed to jointly train word and paragraph embeddings, so you will be able to obtain both of them as the outputs. Specifically, to train sentence embeddings, you will need to:

prepare the text corpus file where each line is one sentence;
specify the -doc-output argument to be the saving directory of the sentence embeddings;

Then you will be able to obtain the sentence embeddings from the directory specified by -doc-output. As an example for setting the argument, you can refer to eval_cluster.sh.

Regarding the paraphrase identification task, I haven't tested the framework against other baselines, so I can't comment on its performance on this task. That being said, this method is able to learn sentence embeddings and should serve as a baseline.

Please let me know if you have any further questions!

Best, Yu

Lier007 commented 5 years ago

Thanks for your answer. What if I want to get the embedding for a new sentence? It maybe means that the sentence is not in the training set. Should I average the embedding of each word in the sentence?

yumeng5 commented 5 years ago

For unsupervised sentence embeddings (like this method and Doc2Vec), there is usually not a clear distinction between training/testing sets, because unsupervised methods do not require (and cannot make use of) labeled data, therefore the class labels (e.g. in text classification tasks) in the training sets are completely ignored, and only the raw texts are used to train embeddings. Therefore, you can simply concatenate the texts of testing set with training set to construct a "new" set and train embeddings on it. In this way, you can get the sentence embeddings of the testing set out-of-the-box.

An alternative way would be averaging the word embeddings to construct sentence embeddings, as you mentioned. Although being a competitive baseline, this method was shown to fall behind sentence embedding frameworks. Therefore, my suggestion would be using averaged word embeddings only if you are looking for a simple sentence embedding baseline, or if you care more about computational efficiency (since the averaging operation is obviously faster than training embeddings) than its absolute performance.

Feel free to let me know if anything is still unclear to you.

Best, Yu

Lier007 commented 5 years ago

Thank you very much. I understand what you mean. I want to do vector retrieval online, so method one should not work and I will try method two.

hoagy-davis-digges commented 5 years ago

Would it be possible to freeze the word embeddings and use (Riemann?) SGD to learn the paragraph embedding for a phrase online, similarly to how it’s done in Doc2Vec?

yumeng5 commented 5 years ago

Thanks for the comments. Unfortunately, I'm currently busy with other projects and cannot work on adding additional functionalities now. I might be able to work on that later and I'll post an update here if I get it done.

yumeng5 / Spherical-Text-Embedding

Is the method suited for representing Sentence Embedding for paraphrase identification task? #4