zilliztech / VectorDBBench

A Benchmark Tool for VectorDB
MIT License
458 stars 108 forks source link

Embedding Generation Times Being Included In Final Reported TIme #263

Closed gpudb-nnegahban closed 2 months ago

gpudb-nnegahban commented 5 months ago

Is there something we should be doing to ensure that the benchmark suite does not include the embedding generation time in the final number for 'load duration'?

Shouldn't that time strictly the insert function that preps the record and inserts that record into the database?

For example our overall 1M. 768 load duration is 100s, 60s of that is the embedding value generation. E.G if we comment out the body of our 'insert_embeddings' function hook for our driver, the load test still takes 60s. The actual act of loading to the database only takes 40s, e.g the time spent in the 'insert_embeddings' function of our driver.

Are we missing something in our driver to prevent this timing to be measured like this?

alwayslove2013 commented 5 months ago

@gpudb-nnegahban Currently, all the datasets used in VDBBench are already pre-existing vectors, and there is no process for embedding generation.

If you are looking to incorporate new datasets and test cases with embedding generation, please provide us with more detailed information.

Concerning the "load_duration" metric, we aim to display the time it takes from data insertion to being ready for query, which encompasses the time for inserting vectors, creating an index, building the index, and optimization.

Given that several cloud products automatically initiate index building during data insertion, it becomes challenging to separate these time periods. Additionally, the optimization strategy varies for each product, and directly displaying it could lead to misunderstandings among users. After careful consideration, we have decided to present the total time from data insertion to when it can function correctly.

eglaser77 commented 5 months ago

@alwayslove2013 - just a follow-up clarification...does the data insertion time include the client-side file reading time in addition to the database insertion time?

alwayslove2013 commented 5 months ago

@eglaser77 Yes, load_duration includes the time it takes to read the data file.

In theory, it is possible to read the entire data file before inserting the data and subtract the reading time from the overall duration. However, this presents a significant challenge to the memory capacity of the testing client, especially for datasets of 10M 768dim or larger.

Currently, we are utilizing the pyarrow iterator to read and insert data in order to conserve memory. Simply removing the time it takes for the iterator reading is inappropriate because, during that time, the database is actively processing the data that has been inserted.

Maybe we can try to reduce this time by pre-fetching the data file for the next iterator, or use SSD?

alwayslove2013 commented 5 months ago

@eglaser77 I guess what you need is to directly test the database with the ready data already inserted. Feel free to update to the latest version. We provide an index_already_existed option to skip the insertion and optimization.

pr: #260