Closed agsachin closed 5 months ago
@agsachin Milvus supports parallel insertion, but currently, VectorDBBench only supports single-process insertion. From the results of our test, 16c64g milvus standalone, using hnsw index type, test Openai 500M 1536dim data takes less than 2h (including insert data + build index + optimization).
Would also prefer the ability to do parallel insertion, especially with regards to testing underlying storage systems with regards to their write performance, and of course saving on time.
@alwayslove2013 maybe we can add async interface at pymilvus?
@anrahman4
We currently do not provide the ability of concurrent insertion. However, you can increase the value of config.NUM_PER_BATCH
and implement concurrent insertion within the "insert_embedding" function in the client you wanna test.
By the way, in our previous tests, we found that for most vector databases, such as ZillizCloud and Pinecone, the main bottleneck lies in their ability to build indexes. Even if concurrent insertion can speed up the insertion time, it will still take a long time before it is ready for query.
Thank you very much for your suggestion. We will consider providing the option of concurrent insertion for performance cases in future versions.
few more data we recently observed we have 3 replicas of kafka, datanode and minio. during insert via VectorDBBench, we observed only one replicas show utilisation (cpu and memory) remaining pods there is no utilisation.
Am i doing somethign wrong ? As per my understanding the data should get load balanced and all replicas should have shown some utilisation.
@agsachin Milvus supports parallel insertion, but currently, VectorDBBench only supports single-process insertion. From the results of our test, 16c64g milvus standalone, using hnsw index type, test Openai 500M 1536dim data takes less than 2h (including insert data + build index + optimization).
Lets assume we want to reduce this, around 30 mins. can we achieve this by scaling up the cluster. if yea what all we need to keep in mind and how much we may have to scale up. we have udpated the disk to SSD and increased all replias to 3. we are not seeing utilisation on other pods on scaling up.
few more data we recently observed we have 3 replicas of kafka, datanode and minio. during insert via VectorDBBench, we observed only one replicas show utilisation (cpu and memory) remaining pods there is no utilisation.
Am i doing somethign wrong ? As per my understanding the data should get load balanced and all replicas should have shown some utilisation.
@agsachin
Scaling up DataNode/kafka/minio won't speed up the insertion. The insertion are load balanced on "shard", and the collection "VectorDBBench" we initialized in vdbbench have only one shard; So with one shard in the collection, only one datanode will do the work, others just won't help.
Scaling up IndexNode will defintly speed up build_index
when we are running tests we are more focuessed on search performance, but inserting data takes up lot of time. can we have parallel insert capability, it will save us a lot of time during inserting. we are testing the performance of the milvus cluster with 5M records.