zilliztech / VectorDBBench

A Benchmark Tool for VectorDB
MIT License
458 stars 108 forks source link

Regarding the dataset #253

Open mithunputhusseri opened 6 months ago

mithunputhusseri commented 6 months ago

Is the S3 bucket containing the datasets exposed publicly. If so, can you share the bucket information. And one more doubt what are these filtering

alwayslove2013 commented 6 months ago

@mithunputhusseri All the databases we tested so far support scalar filtering. To test their vector search performance in scalar filtering scenarios, we will prepare corresponding scalar data when inserting vector data.

In the train.parquet file, we prepare a self-incremented id for each vector. (shuffle_train.parquet is just generated by shuffling train.parquet, the correspondence between id and vector remains the same.)

When performing the query test, using the Cohere 1M filter 99% as an example, we will set the filter condition to "id >= 1_000_000 * 0.99". Correspondingly, we will regenerate the ground truth file neighbors_99p.parquet for recall computation.

We think that 1% and 99% represent low and high filtering scenarios well. In the future, we will support more amount of filter rates, and more scalar filtering scenarios, such as adding categorical scalar, adding complex filtering conditions of and / or, etc. If you have any suggestions, feel free to share them with us!