zilliztech / milvus-backup

Backup and restore tool for Milvus
Apache License 2.0
131 stars 47 forks source link

[Bug]: Restoring collection error: fail or timeout to bulk insert #413

Closed iurii-stepanov closed 1 month ago

iurii-stepanov commented 2 months ago

Current Behavior

I am backing up and then restoring a collection of 3 million vectors. The restoring process terminates after inserting just over 1 million vectors with different error messages about s3 timeouts:

[ERROR] [core/backup_impl_restore_backup.go:792] ["fail or timeout to bulk insert"] [error="bulk insert fail, info: failed to list insert logs with root path milvus-backup/backup_2024_08_26_collection_name_test/binlogs/insert_log/450912724765624139/450912724765624140/450912724763393713/

[ERROR] [core/backup_impl_restore_backup.go:792] ["fail or timeout to bulk insert"] [error="bulk insert fail, info: failed to open insert log milvus-backup/backup_2024_08_26_collection_name_test/binlogs/insert_log/450912724765624139/450912724765624140/450912724763394002/450912724763394002/106/450912724763394180

[ERROR] [core/backup_impl_restore_backup.go:792] ["fail or timeout to bulk insert"] [error="bulk insert fail, info: failed to flush block data for shard id 0 to partition 451769618979843979, error: failed to save binlogs, shard id 0, segment id 451769618979845879, channel 'by-dev-rootcoord-dml_9_451769618979843978v0'

If I backup and restore a small collection of 100k vectors, the process completes correctly.

Expected Behavior

The recovery process is completed correctly.

Steps To Reproduce

Create backup of collection:
milvus-backup create -n backup -a '{"default":["collection"]}' --config configs/backup.yaml

Restore collection:
milvus-backup restore -n backup -a '{"default":["collection"]}' -s _recovered --config configs/backup.yaml

I tried using different parameters and versions:
- milvus-backup version: 0.4.19, 0.4.21
- maxSegmentGroupSize: 2G, 4G
- copydata: 128, 32, 4
- crossStorage: false, true
- storageType: aws, s3
In each case, the number of recovered vectors is approximately the same, but slightly different: 1,120,902 or 1,019,620 or 1,224,002. Full collection is 2,830,158.

Environment

Milvus version: 2.3.1
milvus-backup version: 0.4.21
Milvus in cluster mode with external S3, etcd and kafka

Anything else?

Log Config

wayblink commented 2 months ago

@iurii-stepanov Hi,

It seems you're using the same storage, so there's no need to set crossStorage=true. Using this option would cause the data to transfer via service instead of directly using the Copy API, which is much slower.

The restore process appears to be fine, except for a timeout during bulkinsert. This could be due to the storage server being in a busy state. The default restore parallelism isn't very high. You might want to try setting backup.parallelism.restoreCollection=1 and then retry the operation.

iurii-stepanov commented 2 months ago

Hello @wayblink.

Thank you for your answer. But unfortunately, your recommendations did not help.

Please look at the new logs and config. I made a backup of another collection of 2,837,762 vectors. And then tried to restore it twice, rusults in the logs. In both cases - failure: only about 1 million and 600k were inserted.

I even tried to change the copydata=1 option, and restored collection three times: once successful, twice unsuccessful - about 1.8 million and 1.1 million were inserted.

  1. May I control the s3 response timeout in milvus-backup?

  2. I figured out I can control the timeout minio.requestTimeoutMs at the database level. But this option is only available in 2.4x. I am using 2.3.1. Do you think upgrading to 2.4.9 and increasing the parameter can help me? Maybe I can try to use this timeout in 2.3.1?

wayblink commented 2 months ago

Hello @wayblink.

Thank you for your answer. But unfortunately, your recommendations did not help.

Please look at the new logs and config. I made a backup of another collection of 2,837,762 vectors. And then tried to restore it twice, rusults in the logs. In both cases - failure: only about 1 million and 600k were inserted.

I even tried to change the copydata=1 option, and restored collection three times: once successful, twice unsuccessful - about 1.8 million and 1.1 million were inserted.

  1. May I control the s3 response timeout in milvus-backup?
  2. I figured out I can control the timeout minio.requestTimeoutMs at the database level. But this option is only available in 2.4x. I am using 2.3.1. Do you think upgrading to 2.4.9 and increasing the parameter can help me? Maybe I can try to use this timeout in 2.3.1?

Seems error happens in Milvus side? could you please offer Milvus log.

2.3.1 is not recommended as it is the early version of 2.3.x. Now is 2.3.21.

v2.4.10 or v2.3.21 is recommended. You can check if the parameter is supported in v2.3.21.

iurii-stepanov commented 1 month ago

We figured out what the problem was. The problem was on the side of the NAT gateway provided by the cloud provider, it was giving timeouts. The problem was solved by creating our own NAT instance. Thank you.