Open wahajali opened 1 month ago
@wahajali Thank you for your incredibly detailed observation! You’re absolutely right—during the conc_tests
, recall isn’t calculated, so there’s no need for a ground truth file.
Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.
I love the idea of extending the functionality of conc_tests
to allow for an increased number of test queries. We could sample from train_vectors or even generate them randomly, which would provide a more comprehensive evaluation of different vector DB memory strategies. Your insights are invaluable!
- Randomly select vectors from the training dataset and use them as the test queries.
I would like to propose that instead of using the 1,000 test query set that VectorDBBench currently uses during the SEARCH_SERIAL stage to calculate recall, we should use a larger pool of queries during the SEARCH_CONCURRENT phase where QPS is calculated. Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.
To provide context for this proposal, I’d like to share some results from tests I ran comparing the original 1,000 test queries with 10,000 randomly selected queries from the original training dataset on pgvector with HNSW. The tests were run where the index size exceeds the memory cache (shared_buffers in PostgreSQL) available, meaning that the index could not fit in memory. For HNSW, the performance impact is expected to be significant. In the first case, I used the original 1,000 queries, and in the second case, I used 10,000 randomly selected queries from the original dataset. For reference, I’ve also included results for the OpenAI 500K dataset, which fits into memory.
As you can see, the decrease in QPS is dramatic. The low QPS is expected because the index size is significantly larger than the available buffers, which requires disk IO. In the case of 1,000 test queries, this observation isn’t apparent at all perhaps because the limited number of index queries don't force the entire index to be loaded into memory. While a recent improvement in randomly selecting the query index has made the QPS more realistic, I believe this change will make the numbers even more reflective of actual performance.
In my opinion, there are two options: 1) Generate a larger test dataset using the same methodology as previously used. 2) Randomly select vectors from the training dataset and use them as the test queries.