yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package
BSD 3-Clause "New" or "Revised" License
224 stars 29 forks source link

Lang attribute input check #35

Closed Darenar closed 2 years ago

Darenar commented 3 years ago

Hello! First of all, thank you very much for the repo - it is quite handy!

I just found one ambiguous moment (which seems to me, at least) which may confuse other users.

If I have a list with 10 sentences, then the function laser_model.embed_sentences(list_of_sents, lang='en') returns 101024 matrix. On the other hand, if I provide language not as a string, but as a list with a single string, then the function laser_model.embed_sentences(list_of_sents, lang=['en']) returns 11024 matrix. At first, I thought - could it be due to some aggregation, like a mean vectors of all 10 vectors or something. While, according to code it is clearly due to ZIP function. I think it might be a good idea either to add some Warning, or raise an Error in such a case. Though, it is just a suggestion!

yannvgn commented 2 years ago

Thanks for reporting the issue, @Darenar.

You're absolutely right, the behavior is misleading. I released the fix (v1.1.2), it now raises an error in such cases.

Cheers,