Is there any inbuilt support to insert data into milvus parallelly?

zilliztech / VectorDBBench

A Benchmark Tool for VectorDB

MIT License

522 stars 133 forks source link

Is there any inbuilt support to insert data into milvus parallelly? #257

Closed agsachin closed 5 months ago

agsachin commented 8 months ago

when we are running tests we are more focuessed on search performance, but inserting data takes up lot of time. can we have parallel insert capability, it will save us a lot of time during inserting. we are testing the performance of the milvus cluster with 5M records.

alwayslove2013 commented 8 months ago

@agsachin Milvus supports parallel insertion, but currently, VectorDBBench only supports single-process insertion. From the results of our test, 16c64g milvus standalone, using hnsw index type, test Openai 500M 1536dim data takes less than 2h (including insert data + build index + optimization).

anrahman4 commented 8 months ago

Would also prefer the ability to do parallel insertion, especially with regards to testing underlying storage systems with regards to their write performance, and of course saving on time.

xiaofan-luan commented 8 months ago

@alwayslove2013 maybe we can add async interface at pymilvus?

alwayslove2013 commented 8 months ago

@anrahman4 We currently do not provide the ability of concurrent insertion. However, you can increase the value of config.NUM_PER_BATCH and implement concurrent insertion within the "insert_embedding" function in the client you wanna test.

By the way, in our previous tests, we found that for most vector databases, such as ZillizCloud and Pinecone, the main bottleneck lies in their ability to build indexes. Even if concurrent insertion can speed up the insertion time, it will still take a long time before it is ready for query.

Thank you very much for your suggestion. We will consider providing the option of concurrent insertion for performance cases in future versions.

agsachin commented 8 months ago

few more data we recently observed we have 3 replicas of kafka, datanode and minio. during insert via VectorDBBench, we observed only one replicas show utilisation (cpu and memory) remaining pods there is no utilisation.

Am i doing somethign wrong ? As per my understanding the data should get load balanced and all replicas should have shown some utilisation.

agsachin commented 8 months ago

@agsachin Milvus supports parallel insertion, but currently, VectorDBBench only supports single-process insertion. From the results of our test, 16c64g milvus standalone, using hnsw index type, test Openai 500M 1536dim data takes less than 2h (including insert data + build index + optimization).

Lets assume we want to reduce this, around 30 mins. can we achieve this by scaling up the cluster. if yea what all we need to keep in mind and how much we may have to scale up. we have udpated the disk to SSD and increased all replias to 3. we are not seeing utilisation on other pods on scaling up.

XuanYang-cn commented 8 months ago

few more data we recently observed we have 3 replicas of kafka, datanode and minio. during insert via VectorDBBench, we observed only one replicas show utilisation (cpu and memory) remaining pods there is no utilisation.

Am i doing somethign wrong ? As per my understanding the data should get load balanced and all replicas should have shown some utilisation.

@agsachin

Scaling up DataNode/kafka/minio won't speed up the insertion. The insertion are load balanced on "shard", and the collection "VectorDBBench" we initialized in vdbbench have only one shard; So with one shard in the collection, only one datanode will do the work, others just won't help.
Scaling up IndexNode will defintly speed up build_index