Open wahajali opened 8 months ago
Right. We are considering opening up more datasets in the next release, as well as supporting users with their own local datasets.
Currently GIST and SIFT are only used in capacity test. the dataset doesn't contain the ground truth data. I believe that would be
neighbors.parquet
Right. We are considering opening up more datasets in the next release, as well as supporting users with their own local datasets.
Currently GIST and SIFT are only used in capacity test. the dataset doesn't contain the ground truth data. I believe that would be
neighbors.parquet
@alwayslove2013 Also need this! Any update here? or is there any way to generate neighbor.parquet from the origin gist and sift ground truth files? thx
@alwayslove2013 I wanted to ask how we can generate ground truth data. I am using pgvector, and when I remove the index and query the data my understanding is that I should get the GT data. Just to verify this, I tested this on the OpenAI 500K dataset (cosine distance), I found that the there are few mismatches in the GT data that I calculated and the one provided by VectorDBBench. The difference is only in the order, and the set of returned vector is the same. Usually two elements are just swapped.
This happens when there are ties in the ground truth, there is no guarantee that any specific engine will return ties in a specific order, or even in the same order consistently,
I want to run the Search Performance Test on the GIST dataset. I created a new test, since current workloads don't have GIST as part of the performance test. Currently GIST and SIFT are only used in capacity test.
However, the dataset doesn't contain the ground truth data. It only downloads
train.parquet
and doesn't download the ground truth data (I believe that would beneighbors.parquet
).