Open mithunputhusseri opened 6 months ago
@mithunputhusseri All the databases we tested so far support scalar filtering. To test their vector search performance in scalar filtering scenarios, we will prepare corresponding scalar data when inserting vector data.
In the train.parquet
file, we prepare a self-incremented id for each vector. (shuffle_train.parquet
is just generated by shuffling train.parquet
, the correspondence between id and vector remains the same.)
When performing the query test, using the Cohere 1M filter 99% as an example, we will set the filter condition to "id >= 1_000_000 * 0.99". Correspondingly, we will regenerate the ground truth file neighbors_99p.parquet
for recall computation.
We think that 1% and 99% represent low and high filtering scenarios well. In the future, we will support more amount of filter rates, and more scalar filtering scenarios, such as adding categorical scalar, adding complex filtering conditions of and
/ or
, etc. If you have any suggestions, feel free to share them with us!
Is the S3 bucket containing the datasets exposed publicly. If so, can you share the bucket information. And one more doubt what are these filtering