weaviate / weaviate

Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database​.
https://weaviate.io/developers/weaviate/
BSD 3-Clause "New" or "Revised" License
11.4k stars 791 forks source link

Poor search accuracy on multi2vec-clip text search with text fields #3535

Open LazerCube opened 1 year ago

LazerCube commented 1 year ago

Summary

When using text fields in a multi2vec-clip class, I've observed a large decrease in search accuracy. Specifically, searching for objects with exact-matching descriptions aren't always ranked at the top or are sometimes omitted entirely. For example, an image of an airplane with a description of "An airplane flying through a blue sky" isn't returned when searching for "airplane". But if the description is empty it works as expected.

I seem to get the same issue using any of the text search methods, so I'm not really sure what's going on.

I might just be doing something incorrectly so I've created a repository to demo the issue in more detail here

tsmith023 commented 1 year ago

There's a PR https://github.com/weaviate/weaviate/pull/3560 that's fixing a bug with the multi2vec-clip vectoriser that may or may not influence this issue. As part of the CI, preview images are published to Docker hub.

If you're willing, could you test with the semitechnologies/weaviate:preview-ensure-name-is-passed-correctly-in-function-647aadc image to see if your issue is resolved by the bugfix? Cheers!

LazerCube commented 1 year ago

I've just tested with the semitechnologies/weaviate:preview-ensure-name-is-passed-correctly-in-function-647aadc image, but unfortunately, the issue sadly persists.

tsmith023 commented 1 year ago

To narrow down the potential problems, would you be able to experiment with different weights for the imageFields and textFields in the schema definition?

Specifically, if you weight it to be 100% image and 0% text and 0% image and 100% text then observe the results. This will help identify if it's a problem with the models themselves or with the heuristic that the vectorisation process uses above the models. Also, to be clear, are you running NearText searches?

LazerCube commented 1 year ago

Yes, I am using NearText searches. I tested again by adjusting the field weights: first to 100% for images and then to 100% for text. But regardless of these adjustments, there was no variation in the search results. I even deleted and recreated the schema before re-importing my test data each time, but even the certainty and distance values in the results remained consistent across the two configurations.

However, I also tried using multi2vec-bind and that does seem to work as expected. So I might just transition to that instead.

tsmith023 commented 1 year ago

Thanks for experimenting with your configuration and providing the great feedback. To me, this signifies that the query provided in NearText is not being vectorised into the same vector space as the objects. As mentioned, there are heuristics involved in multi2vec-clip so this could very well be the source of unexpected behviour. I will therefore leave this issue open since I feel it is a bug

As to using multi2vec-bind, do be aware of its potentially restrictive non-open-source license!