yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package
BSD 3-Clause "New" or "Revised" License
224 stars 29 forks source link

How to find similar embeddings?? #24

Closed iamsainianuj closed 3 years ago

iamsainianuj commented 4 years ago

I really loved your work of porting the LASER as python pip package, I am new and trying to learn the use of these embeddings.

what i have done so far:

I have a list of sentences in three languages(L_1,L_2,L_3 let's say).

final = L_1 + L_2 + L_3

Generated embeddings as shown below:

import laserembeddings as le
laser = le.Laser()
langs=[]
for i in range(len(L_1)):
     langs.append('L_1')
for i in range(len(L_2)):
      langs.append('L_2')
for i in range(len(L_3)):
      langs.append('L_3')

embeddings = laser.embed_sentences(final,lang=langs)

assuming that the index of embeddings is as per their corresponding sentences in final.

now for finding similarity between embeddings what i have done is converted these embeddings into gensim KeyedVectors so that we have the flexibility of using the functions like similar_by_vector() etc.

Lang_based_keys = [sent for sent in final]
sent_vecs = KeyedVectors(vector_size=embeddings.shape[1])
sent_vecs.add(Lang_based_keys,embeddings)

but here i am having the issue, suppose a sentence which was present in the final at the time of generating embeddings "The iPhone SDK, set programming tools developers, enhanced support development iPad"..

but when i try to see what are the closest vectors in the embedding space to the vector of given sentence as follows:

let word = "The iPhone SDK, set programming tools developers, enhanced support development iPad"

sent_vecs.similar_by_vector(sent_vecs[word],topn=100) 

what is see is the closest one given by the model is.

[('Poborsky played minutes, 291 minutes, Czech Republic Euro 2004', 1.0), .... ..]

how is it possible that the similarity was 1.0 which means both these sentences have same vector...

Kindly correct me wherever i am wrong..

Thank you.

yannvgn commented 3 years ago

Hi @iamsainianuj,

I think your issue is related to your use of KeyedVectors, not laserembeddings.

Here's what I would suggest to debug:

Hope this helps (even one year later 😅).

yannvgn commented 3 years ago

I'm closing the issue, feel free to re-open if needed.