Regarding the dataset - Githubissues

@mithunputhusseri All the databases we tested so far support scalar filtering. To test their vector search performance in scalar filtering scenarios, we will prepare corresponding scalar data when inserting vector data.

In the train.parquet file, we prepare a self-incremented id for each vector. (shuffle_train.parquet is just generated by shuffling train.parquet, the correspondence between id and vector remains the same.)

When performing the query test, using the Cohere 1M filter 99% as an example, we will set the filter condition to "id >= 1_000_000 * 0.99". Correspondingly, we will regenerate the ground truth file neighbors_99p.parquet for recall computation.

We think that 1% and 99% represent low and high filtering scenarios well. In the future, we will support more amount of filter rates, and more scalar filtering scenarios, such as adding categorical scalar, adding complex filtering conditions of and / or, etc. If you have any suggestions, feel free to share them with us!

zilliztech / VectorDBBench

Regarding the dataset #253