zilliztech / milvus-backup

Backup and restore tool for Milvus
Apache License 2.0
129 stars 46 forks source link

[Feature]: Ignore Object Store data while generating backing dump #201

Closed ganderaj closed 11 months ago

ganderaj commented 1 year ago

Is your feature request related to a problem? Please describe.

We use Milvus on AWS EKS with ETCD as our metadata store and AWS S3 service as our object store.

Our DB is expected to receive updates almost every minute and hence we want to take the backup on regular intervals (say an hour). At the moment, it takes 50m to generate backup. We are anticipating our DB to grow to 15x in coming weeks.

It'd be really time consuming process to take backups at regular intervals. Since AWS offers 11 9s availability for its S3 data, we would like to have these backups to not include object store data but only metadata. This is said under the assumption that excluding object store from the backup would speed up the process.

Describe the solution you'd like.

Provide a flag (CLI ) while performing milvus-backup create operation to exclude object store data from including in the backup dump. Instead, backup could have pointers to the S3 location rather than having a copy of the data.

Describe an alternate solution.

No response

Anything else? (Additional Context)

What can we do to speed up backup process using milvus-backup tool ? Is there any way currently to can improve parallelism ? If no, I request concerned to kindly consider improving backup speed.

zhuwenxing commented 1 year ago

@ganderaj

Thanks for your feedback. How much data do you have? The time cost for backup mainly depends on the amount of data and the speed of file copy between S3.

And if only metadata is backed up, we can not restore from it, so what's the point of backing up metadata from your perspective?

For the speed of backup, there is also an enhancement https://github.com/zilliztech/milvus-backup/issues/145. Maybe @wayblink is working on it.

ganderaj commented 1 year ago

How much data do you have?"

We have an application (on beta release, 100users) on our PRD Milvus which generated mere 2000 vectors and backing this collection takes over 35m. Vector count will spike to more than a Million once the application progresses to General Release (40K users). Assuming time consumption is directly proportional to vector count, imagine the time it would consume to back up a collection with a Million Vectors and more.

"And if only metadata is backed up, we can not restore from it, so what's the point of backing up metadata from your perspective?"

As I understand, milvus-backup tool considers both object store and metadata store while generating backup dump. Assuming, file copy/read/write of object store is slowing down the process why can't we ignore backing it during backup dump.

AWS S3 in general provides 11 9s of availability and a backup of such data store is redundant. Instead, in your backup dump you could simply make note of the S3 URI (path) rather than actually copying the data. Restore operation can refer to these s3 URIs and fetch the files once a restore operation is triggered on same/different cluster. Please share your thoughts.

For the speed of backup, there is also an enhancement https://github.com/zilliztech/milvus-backup/issues/145

I fail to understand if this enhancement is relevant to our use-case. We haven't enabled minIO on our deployment but use AWS S3 (externalS3). Tool is traversing across all collections and vectors sequentially. Is there a way we can parallelize this operation ?

wayblink commented 1 year ago

@ganderaj Hi, it is a interesting idea to skip backup object store. The main concern is Milvus has compaction/GC logic, the object stores will be updated/deleted. So backup will be corrupted if we keep the storages still. I think it can work only if you disable Milvus GC. However, the storage usage of Milvus may expand several times without GC. You can try this plan and contribution is welcomed.

I am confused about your backup speed, 2000 vectors takes 35min? It seems unreasonable. We haven't found copy speed between S3 a bottleneck in our cloud product. We backuped 100M 768 dim vector in 30mins in a test case last week.

ganderaj commented 1 year ago

I must admit that the benchmark you shared gave us hope on Milvus performance.

I have conducted a test for your review. My attempt to back up a database with 271 Collections amounting to 1009 Vectors consumed ~46mins, yes 46. Our DB size is no where in comparison with yours and yet it is dead slow.

Going by the timestamps under backup log file (attached), you'll find that backup process in few cases consumed over 15m at ["GetPersistentSegmentInfo before flush from milvus"].

Requesting you kindly share your readout and help us achieve published performance.

Thanks.

backup_chatgpt_prod_2310110951.txt

wayblink commented 1 year ago

@ganderaj Hi, I found it cost about 35min waiting for flush. We should look into milvus log to see why it costs so long. What is your milvus version?

wayblink commented 1 year ago

Latest code support a -f option to skip flush in backup. It will ignore data in message stream. Data in message stream will be written to disk every 10 mins. Flush will force to do it. If you can tolerate data loss in 10 minutes or you can ensure there are no write operation in 10 mins when you backup. It will save a lot in your scenario.

ganderaj commented 1 year ago

We have used the --force as you have suggested to skip "Flush". However, our backup containing 1706C--5217P--12800V took 1h 06m for completion. I sense this is not in alignment with your published benchmarks for backup; 30m for 100M vectors.

Can you suggest what can we do to improve backup performance ? Backup logs attached.

backup.log.zip

wayblink commented 12 months ago

@ganderaj Hi, I found there are thousands of collections to backup. It costs more time in get meta by calling Milvus API than coping data. Currently, milvus-backup not support collection level concurrent backup. I think it will largely solve your problem. It is on the roadmap but need some days to come. I'm afraid it is not best practice for Milvus to separate data into too many collections, each is quite small. Milvus prefer large data block to build efficient ANN index. Maybe you can try partition and partitionKey for your business.