Sentence length and embeddings

aalloul commented 4 years ago

Hello and thank you for this library!

I have a question regarding how different sentence lengths are treated. Here's a code I ran:

from laserembeddings.preprocessing import Tokenizer
tokenizer = Tokenizer(lang)
sent_to_embed = tokenizer.tokenize(my_big_sentence) # my _big_sentence is actually a whole document
# len(sent_to_embed.split(" ")) =14037
laser = ls()
laser_extended = ls(embedding_options={"max_tokens": 40000, "max_sentences": 400})
t = sent_to_embed.split(" ")
out = []
for _ in range(100, len(t), 5000):
    print(_)
    out.append({
        "split_at": _,
        "default_embedding": laser.embed_sentences(" ".join(t[:_]), "nl"),
        "extemded_embedding": laser_extended.embed_sentences(" ".join(t[:_]), "nl"),
    })

out.append({
        "split_at": len(t),
        "default_embedding": laser.embed_sentences(" ".join(t), "nl"),
        "extemded_embedding": laser_extended.embed_sentences(" ".join(t), "nl"),
    })

Then I computed the cosine similarity between the last embedding (i.e. out[-1]) and the other ones and the result is in the plot below.

As you can see, one can't differentiate the results from the 2 LASER instances (laser and laser_extended). Is this expected? I also get the very same result with max_tokens = 200. I would've expected that the result doesn't change when the number of tokens exceeds this parameter.

yannvgn commented 3 years ago

Hi @aalloul,

max_tokens and max_sentences are used to make batches, but are not used to truncate the input. These parameters can be used to adjust the computing performance / memory.

If you're wondering if there's a length limit for a sentence, please refer to: https://github.com/facebookresearch/LASER/issues/137#issuecomment-606764408.

yannvgn commented 3 years ago

I'm closing the issue, please feel free to re-open if needed.

yannvgn / laserembeddings

Sentence length and embeddings #26