yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package
BSD 3-Clause "New" or "Revised" License
224 stars 29 forks source link

Support for improved LASER2 embeddings #45

Open Thommy96 opened 1 year ago

Thommy96 commented 1 year ago

Hi,

in the current version of Facebook's LASER repository they provide an improved LASER2 model trained on the same languages as the original LASER model. However they also introduced a sentencepiece model (SPM) for Tokenization. So I made a few changes to your code such that one can use the improved model easily in Python. In order to ensure that it is working, I compared generated embeddings with original LASER2 embeddings (obtained by using a fork of your test data repository. The resulting report shows an almost perfect matching.
The tests for the new embeddings might still have to be adapted such that one can run them with poetry and pytest.

Feel free to check out the changes :)