yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package
BSD 3-Clause "New" or "Revised" License
224 stars 29 forks source link

Different embeddings with different length #17

Closed vchulski closed 4 years ago

vchulski commented 4 years ago

I faced issue that while encoding same sentence but in lists of different length i receive slightly different embeddings.

Here is a code to describe what I meant:

from laserembeddings import Laser

import numpy as np

laser = Laser()
a = laser.embed_sentences(["apple", "banana", "clementina"], lang='en')
b = laser.embed_sentences(["apple"], lang='en')
c = laser.embed_sentences(["apple", "potato", "strawberry"], lang='en')
(a[0]==b[0]).all() # check if all elemnts same
#False
(a[0]==c[0]).all()
#True
np.linalg.norm(a[0]-b[0])
#1.3968409e-07
np.linalg.norm(a[0]-c[0])
#0.0

My goal is to get same embedding of word sentence apple, no matter which size of text list I use - but it seems to be unreal with current version of laserembeddings. I would like to know if such behavior is intentional or it's a bug?

yannvgn commented 4 years ago

Hi Vadim,

Thank you for your feedback.

I suspect that the difference you're noticing is due to the fact that input sentences are processed in batches at inference time (default LASER behavior).

In your example, when computing a, "apple", "banana" and "clementina" are processed together in the same batch. When computing b, "apple" is processed alone.

Internally, each sequence in one batch must have the same length, padding is used to make each sequence as longer as the longest sequence in the batch.

What is more, input sentences are first BPE encoded. In you example:

When processing ["apple", "banana", "clementina"] in a single batch, "apple" is internally represented with the following sequence: <PAD>, ap@@, ple, <END>.

When processing ["apple"] alone, the sequence ap@@, ple, <END> is used.

However I wouldn't really worry about that because the difference is negligible: np.allclose(a[0], b[0]) would return True.

If you really need zero difference, you can set the maximum batch size to 1:

laser = Laser(embedding_options={'max_sentences': 1})

See https://github.com/yannvgn/laserembeddings/blob/v1.0.0/laserembeddings/embedding.py#L26.

I hope this answers your question.

Cheers

vchulski commented 4 years ago

Hi @yannvgn,

Thanks for your fast and informational response. Now I understand the source of problem in my application of laserembeddings.

However I wouldn't really worry about that because the difference is negligible: np.allclose(a[0], b[0]) would return True.

This is not really True. I've checked it:

from laserembeddings import Laser

import numpy as np
laser = Laser()
a = laser.embed_sentences(["apple", "banana", "clementina"], lang='en')
b = laser.embed_sentences(["apple"], lang='en')
c = laser.embed_sentences(["apple", "potato", "strawberry"], lang='en')
np.allclose(a[0], b[0])  #False

I have one more question regarding laser = Laser(embedding_options={'max_sentences': 1})

I noticed that it works even faster than laser = Laser() in my test: https://gist.github.com/vchulski/8a89108ea4431ae7984bbc44a4a2627e

In my env it takes:

encoding a,b,c takes: 0.996 seconds
a[0] and b[0] are almost same: False
a[0] and c[0] are almost same: True
L2 distance between a[0] and b[0]: 1.3968409007247828e-07
L2 distance between a[0] and c[0]: 0.0
L2 distance between a[1] and c[1]: 0.3170284330844879
---------------------------------
encoding (max_sentences=1) a,b,c takes: 0.685 seconds
a[0] and b[0] are almost same: True
a[0] and c[0] are almost same: True
L2 distance between a[0] and b[0]: 0.0
L2 distance between a[0] and c[0]: 0.0
L2 distance between a[1] and c[1]: 0.3170284330844879

I run several runs and second part always takes less time than first one. My question is how it could be possible if usual one encodes all of them in batch while second one encode each one separately.

Thanks in advance! Your response really helped me.

yannvgn commented 4 years ago

This is not really True. I've checked it.

Aaaah right ;) Thanks for the correction. But still, the difference is really small.

I run several runs and second part always takes less time than first one. My question is how it could be possible if usual one encodes all of them in batch while second one encode each one separately.

The timing difference you're seeing here is not due to the use of batches or not, but to the initialization of the tokenization library, which is done only once. Just make the test in the reverse order (laser_1 first) and you'll see that it's always the first run that takes longer.

If you want to measure the timing difference, you'll have to use separate processes (or you'll have to make sure that everything is initialized, by running embed_sentences once, before the measurements).

Cheers

vchulski commented 4 years ago

@yannvgn I broke this into two separate scripts and make several runs.

It still looks like laser = Laser(embedding_options={'max_sentences': 1}) is faster than laser = Laser()

which seems a little strange taking into account that it's process each sentence vs processing batch.

Anyway huge thanks for workaround.

yannvgn commented 4 years ago

Ok, but your batch is relatively small (3 elements). If you try with more sentences, you should see that the batched version is faster:

import timeit
from laserembeddings import Laser

laser = Laser()
laser_1 = Laser(embedding_options={'max_sentences': 1})

# 300 sentences
sentences = ['apple', 'potato', 'strawberry'] * 100

def init():
    laser.embed_sentences(['init'], lang='en')

def test_with_batch():
    laser.embed_sentences(sentences, lang='en')

def test_without_batch():
    laser_1.embed_sentences(sentences, lang='en')

# make sure everything is initialized
init()

print(timeit.timeit(test_with_batch, number=10))
# 4.908349000004819

print(timeit.timeit(test_without_batch, number=10))
# 66.42792240000563
vchulski commented 4 years ago

Ok, but your batch is relatively small (3 elements).

That's a valid point. Thanks for example you provided and all of your answers.