yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package
BSD 3-Clause "New" or "Revised" License
224 stars 29 forks source link

No Error raised for a false/wrong tag and same results are obtained even the tag is changed #34

Open VinuraD opened 3 years ago

VinuraD commented 3 years ago

Hi, I was just testing different outputs with the 'laserembeddings' pypi package. One thing I observed is that, it doesn't raise an error for a false tag. (such as 'xx', 'yy' or even for single letters like 'x','y') Also, when I tried with Sinhala language, (tag='si') I observed that I get an output embedding even if I change the tag (to a valid or a false one) and all the time these outputs are the same. How can this behavior be explained? How can we verify the results? Or could there be something wrong with my setup? Python==3.7.10 torch==1.8.1+cu101

from laserembeddings import Laser

laser = Laser()

embeddings = laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='si')

embeddings2=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='y')

embeddings3=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='en')

embeddings4=laser.embed_sentences(
    ["A test sentence"],
    lang='si') #even the tag is different getting a result

comp=embeddings2==embeddings
comp2=embeddings2==embeddings3
print(np.sum(comp))
print(np.sum(comp2))
print(comp.all())
print(embeddings4)

Result:

1024 1024 True [[2.5131851e-03 4.6637398e-04 3.9160903e-05 ... 1.0697229e-02 1.6339000e-02 1.8368352e-02]]

yannvgn commented 3 years ago

Hi @VinuraD,

The sentences are first tokenized before being embedded. The tokenization step relies on Moses in Facebook's LASER original implementation. For portability reasons, I decided to use its Python port, Sacremoses for laserembeddings (Moses is implemented in Perl, Sacremoses is pure Python).

To make the tokenization accurate, Moses uses language-specific lists of non-breaking prefixes (see: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes). If the list is not defined for a language (which is often the case), it defaults to English. Moses displays a warning in that case, Sacremoses don't.

The language identifier is only used during this tokenization step, any further step is totally language-independent.

Sinhala (and also 'xx', 'yy', 'x', 'y', etc) has no non-breaking prefixes, the tokenization rules for English are therefore used. This explains why you get the same results in your example. As LASER was trained on Sinhala (https://github.com/facebookresearch/LASER/#supported-languages), there shouldn't be an issue (and for this reason I think a warning would be misleading in that case).

Now, Sacremoses sometimes gives a slightly different output comparing to Moses, leading to potential embedding differences between LASER and laserembeddings (see https://github.com/yannvgn/laserembeddings#will-i-get-the-exact-same-embeddings). Unfortunately, Sinhala is not included in the test set I'm using to check the consistency between LASER and laserembeddings. You could do some testing by comparing the embeddings given by Facebook's original implementation and given by laserembeddings, if you want to be sure.

I hope this helps!

VinuraD commented 3 years ago

@yannvgn Thanks for the explanation which clarifies a lot. Just to know, can it be justified if there are dissimilarities such as for tags 'km' or 'fy' in your comparison ; https://github.com/yannvgn/laserembeddings/blob/master/tests/report/comparison-with-LASER.md . Does that mean these languages' embeddings aren't valid when using this particular version of 'laserembeddings'?. Is there a specific way that you use adjust the differences/correct them in future versions maybe ?