Closed afoland closed 11 months ago
Hi, Thanks a lot for the comments!
As the INSTRUCTOR model is trained to distinguish the relative similarity between two sentences, the cosine similarity values may need normalization before intuitive explanation. As observed from your example, when the query is not relevant, the cosine similarity is lower, which is expected for the purpose of distinction.
Feel free to add any further questions or comments!
Please re-open the issue if you have any questions or comments!
In high dimensional spaces cosine similarity of 0.7 is very significant
I agree, it's possibly indicative of a bug that all similarities live in a tight band, and there are never negative similarities.
Is there a reason the model seems only to output embeddings with cosine similarities all in a very narrow range?
(One way this can happen is if effectively only a very small subspace of the 768 dimensions is getting used)
I have tried a number of different tasks, with many different strings and types of strings, and find that the results are nearly always in a very narrow range from about +0.4 to +0.9. This is despite creating test sets that should generate lots of orthogonal embeddings and graded similarities from near to further. I have literally been unable to get a value under +0.4 for any two embeddings; I have only ever gotten above +0.9 when testing a vector with itself (as a test.)
I find this true for both base and XL models.
I find this even for the example given in the README. I created the following test code:
I then ran it three times: exactly as posted on the README, adding a printout of similarities (with the query vector with itself added as a test), and then using a different query created to score very highly with the first corpus item and printing out the results:
In high dimensional spaces cosine similarity of 0.7 is very significant; however a question about a yam has nothing visibly to do with capitalism or disparate impact.
Two other embedding models both returned much more intuitive results (nearly all 0 for the first yam query; graduated similarities for the capitalism query) that were both pretty close to one another.
I've found the same thing with other queries and corpuses; the output of similarities is always in a very narrow range.