Closed iurii-stepanov closed 1 month ago
@iurii-stepanov Hi,
It seems you're using the same storage, so there's no need to set crossStorage=true. Using this option would cause the data to transfer via service instead of directly using the Copy API, which is much slower.
The restore process appears to be fine, except for a timeout during bulkinsert. This could be due to the storage server being in a busy state. The default restore parallelism isn't very high. You might want to try setting backup.parallelism.restoreCollection=1 and then retry the operation.
Hello @wayblink.
Thank you for your answer. But unfortunately, your recommendations did not help.
Please look at the new logs and config. I made a backup of another collection of 2,837,762 vectors. And then tried to restore it twice, rusults in the logs. In both cases - failure: only about 1 million and 600k were inserted.
I even tried to change the copydata=1 option, and restored collection three times: once successful, twice unsuccessful - about 1.8 million and 1.1 million were inserted.
May I control the s3 response timeout in milvus-backup?
I figured out I can control the timeout minio.requestTimeoutMs at the database level. But this option is only available in 2.4x. I am using 2.3.1. Do you think upgrading to 2.4.9 and increasing the parameter can help me? Maybe I can try to use this timeout in 2.3.1?
Hello @wayblink.
Thank you for your answer. But unfortunately, your recommendations did not help.
Please look at the new logs and config. I made a backup of another collection of 2,837,762 vectors. And then tried to restore it twice, rusults in the logs. In both cases - failure: only about 1 million and 600k were inserted.
I even tried to change the copydata=1 option, and restored collection three times: once successful, twice unsuccessful - about 1.8 million and 1.1 million were inserted.
- May I control the s3 response timeout in milvus-backup?
- I figured out I can control the timeout minio.requestTimeoutMs at the database level. But this option is only available in 2.4x. I am using 2.3.1. Do you think upgrading to 2.4.9 and increasing the parameter can help me? Maybe I can try to use this timeout in 2.3.1?
Seems error happens in Milvus side? could you please offer Milvus log.
2.3.1 is not recommended as it is the early version of 2.3.x. Now is 2.3.21.
v2.4.10 or v2.3.21 is recommended. You can check if the parameter is supported in v2.3.21.
We figured out what the problem was. The problem was on the side of the NAT gateway provided by the cloud provider, it was giving timeouts. The problem was solved by creating our own NAT instance. Thank you.
Current Behavior
I am backing up and then restoring a collection of 3 million vectors. The restoring process terminates after inserting just over 1 million vectors with different error messages about s3 timeouts:
[ERROR] [core/backup_impl_restore_backup.go:792] ["fail or timeout to bulk insert"] [error="bulk insert fail, info: failed to list insert logs with root path milvus-backup/backup_2024_08_26_collection_name_test/binlogs/insert_log/450912724765624139/450912724765624140/450912724763393713/
[ERROR] [core/backup_impl_restore_backup.go:792] ["fail or timeout to bulk insert"] [error="bulk insert fail, info: failed to open insert log milvus-backup/backup_2024_08_26_collection_name_test/binlogs/insert_log/450912724765624139/450912724765624140/450912724763394002/450912724763394002/106/450912724763394180
[ERROR] [core/backup_impl_restore_backup.go:792] ["fail or timeout to bulk insert"] [error="bulk insert fail, info: failed to flush block data for shard id 0 to partition 451769618979843979, error: failed to save binlogs, shard id 0, segment id 451769618979845879, channel 'by-dev-rootcoord-dml_9_451769618979843978v0'
If I backup and restore a small collection of 100k vectors, the process completes correctly.
Expected Behavior
The recovery process is completed correctly.
Steps To Reproduce
Environment
Anything else?
Log Config