xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.79k stars 132 forks source link

Peculiar Cosine Similarity Values #51

Closed afoland closed 11 months ago

afoland commented 1 year ago

Is there a reason the model seems only to output embeddings with cosine similarities all in a very narrow range?

(One way this can happen is if effectively only a very small subspace of the 768 dimensions is getting used)

I have tried a number of different tasks, with many different strings and types of strings, and find that the results are nearly always in a very narrow range from about +0.4 to +0.9. This is despite creating test sets that should generate lots of orthogonal embeddings and graded similarities from near to further. I have literally been unable to get a value under +0.4 for any two embeddings; I have only ever gotten above +0.9 when testing a vector with itself (as a test.)

I find this true for both base and XL models.

I find this even for the example given in the README. I created the following test code:

arguments = sys.argv
original=(arguments[1].lower()=="true")
q1=(arguments[2].lower()=="true")
crosscheck=True and ~original

#Below copy-pasted from https://github.com/HKUNLP/instructor-embedding except 
#  for if statements failing "original"

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-base')

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

if (original or q1):
    query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
else:
    query  = [['Represent the Wikipedia question for retrieving supporting documents: ','what is the dominant economic theory in the United States?']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loans—and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession"],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.']]
if (crosscheck and ~original):
    corpus.append(query[0]+[])
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)
if (original):
    pass
else:
    print(similarities)

I then ran it three times: exactly as posted on the README, adding a printout of similarities (with the query vector with itself added as a test), and then using a different query created to score very highly with the first corpus item and printing out the results:

$ python3 instructor_example_2.py True True load INSTRUCTOR_Transformer max_seq_length 512 3 $ python3 instructor_example_2.py False True load INSTRUCTOR_Transformer max_seq_length 512 3 [[0.7325637 0.71300924 0.7206404 1. ]] $ python3 instructor_example_2.py False False load INSTRUCTOR_Transformer max_seq_length 512 3 [[0.86386305 0.83299637 0.8046411 0.9999999 ]]

In high dimensional spaces cosine similarity of 0.7 is very significant; however a question about a yam has nothing visibly to do with capitalism or disparate impact.

Two other embedding models both returned much more intuitive results (nearly all 0 for the first yam query; graduated similarities for the capitalism query) that were both pretty close to one another.

I've found the same thing with other queries and corpuses; the output of similarities is always in a very narrow range.

hongjin-su commented 1 year ago

Hi, Thanks a lot for the comments!

As the INSTRUCTOR model is trained to distinguish the relative similarity between two sentences, the cosine similarity values may need normalization before intuitive explanation. As observed from your example, when the query is not relevant, the cosine similarity is lower, which is expected for the purpose of distinction.

Feel free to add any further questions or comments!

hongjin-su commented 11 months ago

Please re-open the issue if you have any questions or comments!

freckletonj commented 11 months ago

In high dimensional spaces cosine similarity of 0.7 is very significant

I agree, it's possibly indicative of a bug that all similarities live in a tight band, and there are never negative similarities.