xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.87k stars 135 forks source link

not getting exactly the same embedding for different batchsize #76

Open kirnap opened 1 year ago

kirnap commented 1 year ago

Hi,

I recently discovered that model.encode method does not give exactly the same embedding for different batch_size values. However, they're still close when I play with atol (absolute tolerance). Is this an expected behaviour or something buggy?

You may find minimal code snippet to replicate the conflicting embeddings:


import pandas as pd
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

query_instruction = 'Represent the Movie query for retrieving similar movies or tv shows: '
s1 = 'word'

batch = [[query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1]]

bbig2 = model.encode(batch, batch_size=2)
bbig4 = model.encode(batch, batch_size=4)
bbig1 = model.encode(batch, batch_size=1)

import numpy as np
if not np.allclose(bbig4, bbig1, atol=1e-8):
    print('Different batchsize is not close for 1e-8 absolute tolerance')
if np.allclose(bbig4, bbig1, atol=1e-7):
    print('Different batchsize is close enough for 1e-7 absolute tolerance')

This prints out the following results:

Different batchsize is not close for 1e-8 absolute tolerance
Different batchsize is close enough for 1e-7 absolute tolerance

thanks in advance!

aditya-y47 commented 1 year ago

Any more findings on this yet?

kirnap commented 1 year ago

Not from my end

dkirman-re commented 1 year ago

Most likely something to do with the underlying HF transformers package. It's a lot of finger pointing, but still no resolution at this point unfortunately. Relevant Github Issues: https://github.com/UKPLab/sentence-transformers/issues/2312 https://github.com/huggingface/transformers/issues/2401

eyalyoli commented 10 months ago

I'm having the same issue, it tried manipulating other things like order or content of the batch, the only factor that affects this is the batch size.

ayalaall commented 10 months ago

Same here. I'm getting a different embedding for different batch_size. The embeddings start to differ from about the 7 decimal point.