zilliztech / VectorDBBench

A Benchmark Tool for VectorDB
MIT License
561 stars 151 forks source link

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

Open wahajali opened 1 month ago

wahajali commented 1 month ago

I would like to propose that instead of using the 1,000 test query set that VectorDBBench currently uses during the SEARCH_SERIAL stage to calculate recall, we should use a larger pool of queries during the SEARCH_CONCURRENT phase where QPS is calculated. Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

To provide context for this proposal, I’d like to share some results from tests I ran comparing the original 1,000 test queries with 10,000 randomly selected queries from the original training dataset on pgvector with HNSW. The tests were run where the index size exceeds the memory cache (shared_buffers in PostgreSQL) available, meaning that the index could not fit in memory. For HNSW, the performance impact is expected to be significant. In the first case, I used the original 1,000 queries, and in the second case, I used 10,000 randomly selected queries from the original dataset. For reference, I’ve also included results for the OpenAI 500K dataset, which fits into memory.

Dataset Size QPS Test Data Size
5M OpenAI 1030.69815 1000 (original)
5M OpenAI 4.4392 10000 (generated)
500K OpenAI 1276.48165 1000 (original)
500K OpenAI 1143.8795 10000 (generated)

As you can see, the decrease in QPS is dramatic. The low QPS is expected because the index size is significantly larger than the available buffers, which requires disk IO. In the case of 1,000 test queries, this observation isn’t apparent at all perhaps because the limited number of index queries don't force the entire index to be loaded into memory. While a recent improvement in randomly selecting the query index has made the QPS more realistic, I believe this change will make the numbers even more reflective of actual performance.

In my opinion, there are two options: 1) Generate a larger test dataset using the same methodology as previously used. 2) Randomly select vectors from the training dataset and use them as the test queries.

alwayslove2013 commented 4 weeks ago

@wahajali Thank you for your incredibly detailed observation! You’re absolutely right—during the conc_tests, recall isn’t calculated, so there’s no need for a ground truth file.

Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

I love the idea of extending the functionality of conc_tests to allow for an increased number of test queries. We could sample from train_vectors or even generate them randomly, which would provide a more comprehensive evaluation of different vector DB memory strategies. Your insights are invaluable!

  1. Randomly select vectors from the training dataset and use them as the test queries.