Closed elpsycongroo closed 2 months ago
@elpsycongroo
Do you know the source of this information? If it's not related to your issue, I'll delete it.
done
@elpsycongroo Do you know the source of this information? If it's not related to your issue, I'll delete it.
Sure, I've no idea about these replys
@elpsycongroo
Wraps: (2) no binlog to import, input=[paths:\"backup/mybackup/binlogs/insert_log/451699909127432302/451699909127432303/451699909127432419/\" paths:\"backup/mybackup/binlogs/delta_log/451699909127432302/451699909127432303/451699909127432419/\" ]: invalid parameter
Milvus return this error says the log path is empty, could you check this two path, whether there are files in it?
@elpsycongroo
Wraps: (2) no binlog to import, input=[paths:\"backup/mybackup/binlogs/insert_log/451699909127432302/451699909127432303/451699909127432419/\" paths:\"backup/mybackup/binlogs/delta_log/451699909127432302/451699909127432303/451699909127432419/\" ]: invalid parameter
Milvus return this error says the log path is empty, could you check this two path, whether there are files in it?
Yes, they are empty.
Because as I say at the beginning, the milvus-backup's file path is backup/mybackup/binlogs/delta_log/451699909127432302/451699909127432303/451699909127432419/451699909127432419/451699909126213016
, one more 451699909127432419
generated, so when it try to restore, there's only a folder but no files. However the origin binlog file path in milvus is by-dev/delta_log/451699909127432302/451699909127432303/451699909127432419/451699909126213016
, I wonder why the milvus-backup would generate a different path and cause this problem?
origin files:
backup files:
It shows that the file path is different.
@elpsycongroo This behavior is intentional due to historical reasons.
About a year ago, the bulkInsert operation had limited parallelism capabilities and could only process a single partition path at a time during restoration. To manage large partitions, we had to split a partition into multiple groups. As a result, we structured the backup paths as {collectionID}/{partitionID}/{groupID}/{segmentID}, with each group treated as a 'pseudo-partition.'
In recent versions of Milvus, bulkInsert has been significantly improved to support high levels of parallelism, allowing us to invoke bulkInsert directly by segment. For backward compatibility, we retained the original path format, but now we use the segmentID as the groupID. Therefore, the duplicated segmentID you noticed is actually the groupID in backup logic.
@wayblink
Thank you for your reply! But during the backup operation, based on the logs, the backup process is still searching paths that do not contain duplicate segment IDs. This causes the backup to fail due to the absence of binlog files in the specified path, which I don't think is a normal situation.
So how could I finish the recovery? If it's a feature, was there something I did wrong? I look forward to and appreciate your further assistance.
@elpsycongroo Yes, as you said, it is not normal. Please set log.level to debug and retry backup & restore. Upload the logs, I will try to find the problem.
@wayblink
./milvus-backup create -n mybackup -d test
create.log
./milvus-backup restore --restore_index -n mybackup -s _recover
restore.log
some information is masked.
@elpsycongroo It is weird. I have checked the log, the milvus-backup process works as expected. And I tested on latest master branch and v2.4.5, neither reproduced.
Can you check the path backup/mybackup/binlogs/insert_log/451699909127432302/451699909127432303/451699909127432419/
and backup/mybackup/binlogs/delta_log/451699909127432302/451699909127432303/451699909127432419/
?
And let's take a look at the milvus.log. I'm not sure there will be some clue.
@wayblink I've tried again and it's still failed... This time I dropped all collections in the database named 'test' and re-created one collection , there's no 'delta_log', so I could only provide insert_log: backup/mybackup/binlogs/insert_log/451699909139731630/451699909139731631/451699909139731920/
files' detail:
by-dev/insert_log/451699909139731630/451699909139731631/451699909139731920
files' detail:
logs: create.log restore.log
The error message 'no binlog to import' is returned by milvus, milvus/internal/datacoord/services.go L1703:
files = lo.Filter(files, func(file *internalpb.ImportFile, _ int) bool {
return len(file.GetPaths()) > 0
})
if len(files) == 0 {
resp.Status = merr.Status(merr.WrapErrParameterInvalidMsg(fmt.Sprintf("no binlog to import, input=%s", in.GetFiles())))
return resp, nil
}
while milvus-backup log is :
[2024/09/02 10:08:59.589 +08:00] [ERROR] [core/backup_impl_restore_backup.go:782] ["fail to bulk insert"] [db=test] [collectionName=test_recover] [partitionName=_default] [files="[backup/mybackup/binlogs/insert_log/451699909139731630/451699909139731631/451699909139731920/,]"] [error="no binlog to import, input=[paths:\"backup/mybackup/binlogs/insert_log/451699909139731630/451699909139731631/451699909139731920/\" paths:\"\" ]
so it seems like milvus cannot find vaild binlog files in the current path.
related datacoord pod log :
[2024/09/02 02:54:56.688 +00:00] [INFO] [datacoord/services.go:1670] ["receive import request"] [collection=451699909139758789] [partitions="[451699909139758790]"] [channels="[by-dev-rootcoord-dml_12_451699909139758789v0]"] [files="[{\"paths\":[\"backup/mybackup/binlogs/insert_log/451699909139731630/451699909139731631/451699909139731920/\",\"\"]}]"]
[2024/09/02 02:54:56.688 +00:00] [INFO] [storage/remote_chunk_manager.go:317] ["start walk through objects"] [prefix=backup/mybackup/binlogs/insert_log/451699909139731630/451699909139731631/451699909139731920/] [recursive=false]
[2024/09/02 02:54:56.707 +00:00] [INFO] [storage/remote_chunk_manager.go:324] ["finish walk through objects"] [prefix=backup/mybackup/binlogs/insert_log/451699909139731630/451699909139731631/451699909139731920/] [recursive=false]
[2024/09/02 02:54:56.707 +00:00] [INFO] [storage/remote_chunk_manager.go:317] ["start walk through objects"] [prefix=] [recursive=false]
[2024/09/02 02:54:56.721 +00:00] [INFO] [storage/remote_chunk_manager.go:324] ["finish walk through objects"] [prefix=] [recursive=false]
[2024/09/02 02:54:56.934 +00:00] [INFO] [datacoord/channel_manager_v2.go:597] ["Check ToWatch/ToRelease channel operations progress"] ["channel count"=1] ["channel names"="[by-dev-rootcoord-dml_12_451699909139758789v0]"]
[2024/09/02 02:54:56.934 +00:00] [INFO] [datacoord/channel_manager_v2.go:669] ["Got channel operation progress"] [opID=451699909139758793] [nodeID=38] ["check operation"=ToWatch] [channel=by-dev-rootcoord-dml_12_451699909139758789v0] ["got state"=WatchSuccess] [progress=100]
[2024/09/02 02:54:56.935 +00:00] [INFO] [datacoord/channel_manager_v2.go:630] ["Finish to Check ToWatch/ToRelease channel operations progress"] ["channel count"=1] ["channel names"="[by-dev-rootcoord-dml_12_451699909139758789v0]"]
[2024/09/02 02:54:57.944 +00:00] [DEBUG] [datacoord/import_scheduler.go:167] ["peek slots done"] [nodeSlots="{\"38\":16}"]
[2024/09/02 02:54:59.755 +00:00] [DEBUG] [config/etcd_source.go:154] ["etcd refreshConfigurations"] [prefix=by-dev/config] [endpoints="[10.128.16.9:2379,10.128.16.13:2379,10.128.16.14:2379]"]
[2024/09/02 02:54:59.943 +00:00] [DEBUG] [datacoord/import_scheduler.go:167] ["peek slots done"] [nodeSlots="{\"38\":16}"]
[2024/09/02 02:55:00.396 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=b1b3fdecc950a93c49b8f2fcde79339b] [collectionID=451699909136643037]
[2024/09/02 02:55:00.397 +00:00] [DEBUG] [datacoord/index_service.go:892] ["GetIndexInfos successfully"] [traceID=f017ae80cfa4533dd8a603b499099f4c] [collectionID=451699909136643037] [indexName=]
[2024/09/02 02:55:00.398 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=388d143c54155066299c2bd4535d6a75] [collectionID=451699909136664853]
[2024/09/02 02:55:00.398 +00:00] [DEBUG] [datacoord/index_service.go:892] ["GetIndexInfos successfully"] [traceID=3bfeff86ef30d240a5ed60a85031ca9a] [collectionID=451699909136664853] [indexName=]
[2024/09/02 02:55:00.399 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=07204e64fc5c2436264b852071c8e7bc] [collectionID=451699909136643037]
[2024/09/02 02:55:00.399 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=c19544676ac80d37ea23b06a22918fe0] [collectionID=451699909136664853]
[2024/09/02 02:55:00.400 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=cdce612f8c7e6df870f981a488148363] [collectionID=451699909136777573]
[2024/09/02 02:55:00.400 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=9fc7f7f5a68e577897f90d1292f93afd] [collectionID=451699909138806319]
[2024/09/02 02:55:00.400 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=8fa843557b3bff59d5f1d54342fc424f] [collectionID=451699909139731630]
[2024/09/02 02:55:00.400 +00:00] [DEBUG] [datacoord/index_service.go:892] ["GetIndexInfos successfully"] [traceID=c241afc6c830966f3dfd12b25e9ee553] [collectionID=451699909136664853] [indexName=]
[2024/09/02 02:55:00.402 +00:00] [DEBUG] [datacoord/index_service.go:892] ["GetIndexInfos successfully"] [traceID=75f52fa4ac9faa89a2efc4c2ed7b7499] [collectionID=451699909136664853] [indexName=]
[2024/09/02 02:55:00.403 +00:00] [INFO] [datacoord/index_service.go:924] ["List index success"] [traceID=c705411db536a344700984e841492dca] [collectionID=451699909136777573]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/services.go:820] ["get recovery info request received"] [traceID=cf76886544262fdefdcde5296c98cafc] [collectionID=451699909136643037] [partitionIDs="[]"]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/handler.go:117] [GetQueryVChanPositions] [collectionID=451699909136643037] [channel=by-dev-rootcoord-dml_9_451699909136643037v0] [numOfSegments=1] ["indexed segment"=1]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/handler.go:302] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_9_451699909136643037v0] [posTs=452262792944156674] [posTime=2024/09/02 02:53:59.588 +00:00]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/services.go:835] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=cf76886544262fdefdcde5296c98cafc] [collectionID=451699909136643037] [partitionIDs="[]"] [channel=by-dev-rootcoord-dml_9_451699909136643037v0] ["# of unflushed segments"=0] ["# of flushed segments"=1] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=0]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/services.go:820] ["get recovery info request received"] [traceID=99ec7fca1d8f7efc6351a2a9277d113a] [collectionID=451699909139731630] [partitionIDs="[]"]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/handler.go:117] [GetQueryVChanPositions] [collectionID=451699909139731630] [channel=by-dev-rootcoord-dml_0_451699909139731630v0] [numOfSegments=1] ["indexed segment"=0]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/handler.go:302] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_0_451699909139731630v0] [posTs=452262801332764674] [posTime=2024/09/02 02:54:31.588 +00:00]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/services.go:835] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=99ec7fca1d8f7efc6351a2a9277d113a] [collectionID=451699909139731630] [partitionIDs="[]"] [channel=by-dev-rootcoord-dml_0_451699909139731630v0] ["# of unflushed segments"=0] ["# of flushed segments"=1] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=0]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/services.go:820] ["get recovery info request received"] [traceID=6e6239c6056500741e9b57ec75f757e3] [collectionID=451699909138806319] [partitionIDs="[]"]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/handler.go:117] [GetQueryVChanPositions] [collectionID=451699909138806319] [channel=by-dev-rootcoord-dml_15_451699909138806319v0] [numOfSegments=1] ["indexed segment"=1]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/handler.go:302] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_15_451699909138806319v0] [posTs=452262792944156674] [posTime=2024/09/02 02:53:59.588 +00:00]
[2024/09/02 02:55:00.404 +00:00] [INFO] [datacoord/services.go:835] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=6e6239c6056500741e9b57ec75f757e3] [collectionID=451699909138806319] [partitionIDs="[]"] [channel=by-dev-rootcoord-dml_15_451699909138806319v0] ["# of unflushed segments"=0] ["# of flushed segments"=1] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=0]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/services.go:820] ["get recovery info request received"] [traceID=9e301463d9f03fe2a32b4af198d2f32c] [collectionID=451699909136664853] [partitionIDs="[]"]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/handler.go:117] [GetQueryVChanPositions] [collectionID=451699909136664853] [channel=by-dev-rootcoord-dml_10_451699909136664853v0] [numOfSegments=3] ["indexed segment"=2]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/handler.go:302] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_10_451699909136664853v0] [posTs=452262792944156674] [posTime=2024/09/02 02:53:59.588 +00:00]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/services.go:835] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=9e301463d9f03fe2a32b4af198d2f32c] [collectionID=451699909136664853] [partitionIDs="[]"] [channel=by-dev-rootcoord-dml_10_451699909136664853v0] ["# of unflushed segments"=0] ["# of flushed segments"=1] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=2]
[2024/09/02 02:55:00.405 +00:00] [DEBUG] [datacoord/index_service.go:892] ["GetIndexInfos successfully"] [traceID=9ce825dd76beadd3a592429c38ab7f8c] [collectionID=451699909136777573] [indexName=]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/services.go:820] ["get recovery info request received"] [traceID=87525de739cc51f178aa4d53a4b22053] [collectionID=451699909136777573] [partitionIDs="[]"]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/handler.go:117] [GetQueryVChanPositions] [collectionID=451699909136777573] [channel=by-dev-rootcoord-dml_11_451699909136777573v0] [numOfSegments=1] ["indexed segment"=1]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/handler.go:302] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_11_451699909136777573v0] [posTs=452262792944156674] [posTime=2024/09/02 02:53:59.588 +00:00]
[2024/09/02 02:55:00.405 +00:00] [INFO] [datacoord/services.go:835] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=87525de739cc51f178aa4d53a4b22053] [collectionID=451699909136777573] [partitionIDs="[]"] [channel=by-dev-rootcoord-dml_11_451699909136777573v0] ["# of unflushed segments"=0] ["# of flushed segments"=1] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=0]
@elpsycongroo I'm out of ideas. Can you edit and compile milvus code in your environment? For me, the only way is adding some log in Milvus and find out the root cause.
@wayblink Good news: I tried in my test cluster, and it finished with no error. But it still cannot restore in prod cluster. The only related milvus-backup log is:
[core/backup_impl_restore_backup.go:782] ["fail to bulk insert"] [db=test] [collectionName=test_recover] [partitionName=_default] [files="[backup/mybackup/binlogs/insert_log/451699909141706930/451699909141706931/451699909141707048/,]"] [error="no binlog to import, input=[paths:\"backup/mybackup/binlogs/insert_log/451699909141706930/451699909141706931/451699909141707048/\" paths:\"\" ]: invalid parameter"]
I've found the 'no binlog to import' error was from milvus, but 'invalid parameter' was nowhere to seek. Do you have any ideas about that or what the difference between my two cluster that could cause the problem?
@elpsycongroo Is prod and test env the same version ? Is your prod milvus a public release version all built by yourself? Ant commitID?
@wayblink The version of milvus is the same because we use the same image. It's a public release, v2.4.5 ,60695bd The difference between prod and test env is kafka conf and/or version. S3 and etcd's versions' all the same.
@elpsycongroo Sorry, I have no idea now. If I were you, the only way I can do is adding some log in source code, build and retry. ~~
@wayblink OK, thank you for your help. Another question(not related to this issue): My prod env S3 is using SSL while test env is not, how can I configure milvus-backup if I want to copy data from prod S3 to test S3, or it's not supported for now? Beacuse there's no option like 'backupUseSSL'.
@elpsycongroo Actullay it is supported. minio.backupUseSSL
, I didn't add it in backup.yaml. You can refer to the code:
Current Behavior
Milvus cluster is depolyed on k8s, and I try to run backup tool for it. It ends successfully, but there're error logs when I restore the backup, and I find out that the backup file paths have one more level. For example,
by-dev/delta_log/451699909127432302/451699909127432303/451699909127432419/451699909126213016
backup/mybackup/binlogs/delta_log/451699909127432302/451699909127432303/451699909127432419/451699909127432419/451699909126213016
Seems that there's one more
451699909127432419
in the backup path, the correct file cannot be find. so the restore log shows that:Run backup check, the status is :
Expected Behavior
It should be restored and the path is correct.
Steps To Reproduce
Environment
Milvus version:2.4.5 Milvus backup version:0.4.21
backup.yaml:
Anything else?
Why the backup file path changed? Full log of backup creating: (I masked the bucket name to 'bucket-name') ./milvus-backup create -n mybackup
Full log of restoring: ./milvus-backup restore --restore_index -n mybackup -s _recover