zilliztech / VectorDBBench

A Benchmark Tool for VectorDB
MIT License
560 stars 151 forks source link

GIST Ground Truth Data Missing #292

Open wahajali opened 8 months ago

wahajali commented 8 months ago

I want to run the Search Performance Test on the GIST dataset. I created a new test, since current workloads don't have GIST as part of the performance test. Currently GIST and SIFT are only used in capacity test.

However, the dataset doesn't contain the ground truth data. It only downloads train.parquet and doesn't download the ground truth data (I believe that would be neighbors.parquet).

alwayslove2013 commented 8 months ago

Right. We are considering opening up more datasets in the next release, as well as supporting users with their own local datasets.

Currently GIST and SIFT are only used in capacity test. the dataset doesn't contain the ground truth data. I believe that would be neighbors.parquet

xinhuitian commented 4 months ago

Right. We are considering opening up more datasets in the next release, as well as supporting users with their own local datasets.

Currently GIST and SIFT are only used in capacity test. the dataset doesn't contain the ground truth data. I believe that would be neighbors.parquet

@alwayslove2013 Also need this! Any update here? or is there any way to generate neighbor.parquet from the origin gist and sift ground truth files? thx

wahajali commented 1 month ago

@alwayslove2013 I wanted to ask how we can generate ground truth data. I am using pgvector, and when I remove the index and query the data my understanding is that I should get the GT data. Just to verify this, I tested this on the OpenAI 500K dataset (cosine distance), I found that the there are few mismatches in the GT data that I calculated and the one provided by VectorDBBench. The difference is only in the order, and the set of returned vector is the same. Usually two elements are just swapped.

greenhal commented 1 month ago

This happens when there are ties in the ground truth, there is no guarantee that any specific engine will return ties in a specific order, or even in the same order consistently,