Open ian-lss opened 7 months ago
@ian-lss could you provide milvus-operator's log? info & warn level logs can also provide much information.
hi, @haorenfsa
Sorry, but there are two issues here. One is the issue at https://github.com/zilliztech/attu/issues/437, and the other is an issue where the upgrade only applies up to the index node. For the part where the upgrade doesn't happen, there are some parts I don't understand, so I'm making changes in the code to leave more detailed logs to definitely clarify and inform about the problem. I'll let you know with the logs after testing, so please wait a little longer.
and the other is an issue where the upgrade only applies up to the index node.
I believe the upgrade process is quite well tested. I'll wait for your logs to see what's going on there.
hi, @haorenfsa
I added a log print to the milvus-operator Source and tested it.
In conclusion, I confirmed that it is operating normally.
I'm very sorry for the confusion.
The reason I thought it was not functioning as I had reported in the issue
was due to a hidden mixcoord terminating component issue.
I have a question:
Even though TerminationGracePeriodSeconds is applied to each component in the updateSomeFieldsOnlyWhenRolling function,
template.Spec.TerminationGracePeriodSeconds = int64Ptr(int64(oneMonthSeconds))
,
After the upgrade, all pods of components except the mixcoord component terminate successfully, but the mixcoord component does not terminate and remains in the terminating state for an entire month.
Is this situation intended?
Therefore, in the current situation where milvus upgrades are frequent (a new version is being released within a month), we have no choice but to manually force delete the mixcoord terminating component and then upgrade.
Is this operation method of manually forcing deletion of the mixcoord terminating component and then upgrading intended?
Below is a part of the log from the mixcoord in terminating state.
[2024/03/20 02:55:28.039 +00:00] [DEBUG] [sessionutil/session_util.go:1043] ["serverName: rootcoord is in STANDBY ..."]
[2024/03/20 02:55:28.836 +00:00] [DEBUG] [sessionutil/session_util.go:1043] ["serverName: querycoord is in STANDBY ..."]
[2024/03/20 02:55:28.836 +00:00] [DEBUG] [sessionutil/session_util.go:1043] ["serverName: datacoord is in STANDBY ..."]
[2024/03/20 02:55:32.347 +00:00] [DEBUG] [config/refresher.go:70] ["etcd refreshConfigurations"] [prefix=ai/config] [endpoints="[ai-etcd.ai-milvus:2379]"]
but the mixcoord component does not terminate and remains in the terminating state for an entire month. Is this situation intended?
No, we expect the old mixcoord to change to standby, and then exit by itself. If a mixcoord hangs in its terminating state, it's likely to be a milvus issue.
Could you also provide us the log before it's teminating ? And is it only happens when upgrade 2.3.8 to newer release?
@weiliu1031 any knew issue in 2.3.8 that mixcoord hangs in terminating?
but the mixcoord component does not terminate and remains in the terminating state for an entire month. Is this situation intended?
No, we expect the old mixcoord to change to standby, and then exit by itself. If a mixcoord hangs in its terminating state, it's likely to be a milvus issue.
Could you also provide us the log before it's teminating ? And is it only happens when upgrade 2.3.8 to newer release?
@weiliu1031 any knew issue in 2.3.8 that mixcoord hangs in terminating?
@haorenfsa Before shutdown, which logs are referred to? I had to force terminate mixcoord due to a version upgrade. The problematic versions are all versions after using milvus-operator: 2.3.8, 2.3.10, 2.3.11, 2.3.12.
@weiliu1031 Which logs are needed if the rolling upgrade process is interrupted? Below is a summary of some logs at the time of the mixcoord hangup.
To briefly explain the analysis,
in the case of failure: ReconcileComponentDeployment(deployments.go) -> updateDeployment(deployments.go) -> CheckComponentHasTerminatingPod(component_condition.go)->HasTerminatingPodByListOpts(component_condition.go)->true return->updateDeploymentWithoutPodTemplate(deployment_updater.go)->IsEqual(old, cur) return true(deployments.go)->the end
In the case of success: ReconcileComponentDeployment(deployments.go) -> updateDeployment(deployments.go) -> CheckComponentHasTerminatingPod(component_condition.go)->HasTerminatingPodByListOpts(component_condition.go)->false return->updateDeployment(deployment_updater.go)->updateDeploymentWithoutPodTemplate(deployment_updater.go)->updatePodTemplate(deployment_updater.go)->updateMilvusContainer(deployment_updater.go)->RollingUpdateImageDependencyReady(deployment_updater.go)->GetDependencies(deployment_updater.go)->IsImageUpdated(components.go)->GetState(milvus_types.go)-> return true -> container.Image = mergedComSpec.Image ->IsEqual(currentTemplate, template)->updateSomeFieldsOnlyWhenRolling->IsEqual(old, cur) return false(deployments.go)->Update(k8s sdk)->the end
2024-03-19T03:08:10.607Z INFO controller.status-syncer syncUnealthyOrUpdating end {"duration": "61.54914ms"}
2024-03-19T03:08:10.610Z INFO controller.milvus Reconcile Component {"name": "vs", "component": "mixcoord", "mode": "cluster"}
2024-03-19T03:08:10.610Z INFO controller.milvus Start Update Deployment {"name": "vs-milvus-mixcoord"}
2024-03-19T03:08:10.610Z INFO controller.milvus [MilvusReconciler.updateDeployment] component has terminating pod, skip update template {"component": "mixcoord"}
2024-03-19T03:08:10.611Z INFO controller.milvus nothing diff context {"name": "vs-milvus-mixcoord", "namespace": "milvus"}
2024-03-19T03:08:10.610Z INFO controller.milvus Reconcile Component {"name": "vs", "component": "standalone", "mode": "cluster"}
2024-03-19T03:08:10.610Z INFO controller.milvus Start Update Deployment {"name": "vs-milvus-standalone"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "standalone"}
2024-03-19T03:08:10.610Z INFO GetDependencies {"depGraph": null, "component": {"Name":"standalone","FieldName":"Standalone","DefaultPort":19530}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep count": 0, "component": {"Name":"standalone","FieldName":"Standalone","DefaultPort":19530}}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "standalone"}
2024-03-19T03:08:10.610Z INFO updatePodTemplate {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "standalone"}
2024-03-19T03:08:10.611Z INFO controller.milvus nothing diff context {"name": "vs-milvus-standalone", "namespace": "milvus"}
2024-03-19T03:08:10.610Z INFO controller.milvus Reconcile Component {"name": "vs", "component": "proxy", "mode": "cluster"}
2024-03-19T03:08:10.610Z INFO controller.milvus Start Update Deployment {"name": "vs-milvus-proxy"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "proxy"}
2024-03-19T03:08:10.610Z INFO GetDependencies {"depGraph": [{"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}], "component": {"Name":"proxy","FieldName":"Proxy","DefaultPort":19530}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep count": 1, "component": {"Name":"proxy","FieldName":"Proxy","DefaultPort":19530}}
2024-03-19T03:08:10.611Z INFO RollingUpdateImageDependencyReady {"dep": "datanode:DataNode", "component": {"Name":"proxy","FieldName":"Proxy","DefaultPort":19530}}
2024-03-19T03:08:10.611Z INFO IsImageUpdated {"ComponentsDeployStatus": 6, "name": "datanode", "component": "proxy"}
2024-03-19T03:08:10.611Z INFO IsImageUpdated {"m.Spec.Com.Image": "milvusdb/milvus:v2.3.11", "deployStatus.Image": "milvusdb/milvus:v2.3.11", "name": "datanode", "component": "proxy"}
2024-03-19T03:08:10.611Z INFO IsImageUpdated {"deployStatus.GetState()": "Complete", "name": "datanode", "component": "vs-milvus.milvus:19530", "component": "proxy"}
2024-03-19T03:08:10.611Z INFO updateMilvusContainer {"container.Image": "milvusdb/milvus:v2.3.10", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "proxy"}
2024-03-19T03:08:10.611Z INFO updatePodTemplate {"IsEqual template": false, "image": "milvusdb/milvus:v2.3.11", "name": "proxy"}
2024-03-19T03:08:10.611Z INFO updatePodTemplate {"updateSomeFieldsOnlyWhenRolling": "Exec", "name": "proxy"}
2024-03-19T03:08:10.610Z INFO controller.milvus Reconcile Component {"name": "vs", "component": "datanode", "mode": "cluster"}
2024-03-19T03:08:10.610Z INFO controller.milvus Start Update Deployment {"name": "vs-milvus-datanode"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "datanode"}
2024-03-19T03:08:10.610Z INFO GetDependencies {"depGraph": [{"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}], "component": {"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep count": 1, "component": {"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep": "querynode:QueryNode", "component": {"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}}
2024-03-19T03:08:10.610Z INFO IsImageUpdated {"ComponentsDeployStatus": 6, "name": "querynode", "component": "datanode"}
2024-03-19T03:08:10.610Z INFO IsImageUpdated {"m.Spec.Com.Image": "milvusdb/milvus:v2.3.11", "deployStatus.Image": "milvusdb/milvus:v2.3.11", "name": "querynode", "component": "datanode"}
2024-03-19T03:08:10.610Z INFO IsImageUpdated {"deployStatus.GetState()": "Complete", "name": "querynode", "component": "vs-milvus.milvus:19530", "component": "datanode"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "datanode"}
2024-03-19T03:08:10.610Z INFO updatePodTemplate {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "datanode"}
2024-03-19T03:08:10.611Z INFO controller.milvus nothing diff context {"name": "vs-milvus-datanode", "namespace": "milvus"}
2024-03-19T03:08:10.610Z INFO controller.milvus Reconcile Component {"name": "vs", "component": "indexnode", "mode": "cluster"}
2024-03-19T03:08:10.610Z INFO controller.milvus Start Update Deployment {"name": "vs-milvus-indexnode"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "indexnode"}
2024-03-19T03:08:10.610Z INFO GetDependencies {"depGraph": [], "component": {"Name":"indexnode","FieldName":"IndexNode","DefaultPort":21121}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep count": 0, "component": {"Name":"indexnode","FieldName":"IndexNode","DefaultPort":21121}}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "indexnode"}
2024-03-19T03:08:10.611Z INFO updatePodTemplate {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "indexnode"}
2024-03-19T03:08:10.611Z INFO controller.milvus nothing diff context {"name": "vs-milvus-indexnode", "namespace": "milvus"}
2024-03-19T03:08:10.610Z INFO controller.milvus Reconcile Component {"name": "vs", "component": "querynode", "mode": "cluster"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "querynode"}
2024-03-19T03:08:10.610Z INFO GetDependencies {"depGraph": [{"Name":"mixcoord","FieldName":"MixCoord","DefaultPort":-1}], "component": {"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep count": 1, "component": {"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}}
2024-03-19T03:08:10.610Z INFO RollingUpdateImageDependencyReady {"dep": "mixcoord:MixCoord", "component": {"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}}
2024-03-19T03:08:10.610Z INFO IsImageUpdated {"ComponentsDeployStatus": 6, "name": "mixcoord", "component": "querynode"}
2024-03-19T03:08:10.610Z INFO IsImageUpdated {"m.Spec.Com.Image": "milvusdb/milvus:v2.3.11", "deployStatus.Image": "milvusdb/milvus:v2.3.11", "name": "mixcoord", "component": "querynode"}
2024-03-19T03:08:10.610Z INFO IsImageUpdated {"deployStatus.GetState()": "Complete", "name": "mixcoord", "component": "vs-milvus.milvus:19530", "component": "querynode"}
2024-03-19T03:08:10.610Z INFO updateMilvusContainer {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "querynode"}
2024-03-19T03:08:10.610Z INFO updatePodTemplate {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "querynode"}
2024-03-19T03:08:10.611Z INFO controller.helm getting history for release vs-kafka
2024-03-19T03:08:10.612Z INFO controller.helm getting history for release vs-etcd
2024-03-19T03:08:10.612Z INFO controller.milvus Update Deployment {
"name": "vs-milvus-proxy",
"namespace": "milvus",
"diff": "--- From\n+++ To
@@ -433,9 +433,9 @@
Name: (string) (len=6) \"config\",
- Image: (string) (len=31) \"milvusdb/milvus-operator:v0.9.5\",
- Command: ([]string) (len=3) {
- (string) (len=3) \"/cp\",
- (string) (len=14) \"/run.sh,/merge\",
- (string) (len=40) \"/milvus/tools/run.sh,/milvus/tools/merge\"
- },\n- Args: ([]string) <nil>,
+ Image: (string) (len=107) \"u-docker.pkg.dev/mp-artifact-registry-aa49/dev/milvus-operator:2024-03-18_14-55_87fcc43\",
+ Command: ([]string) (len=1) {
+ (string) (len=7) \"/bin/sh\"
+ },
+ Args: ([]string) (len=1) {
+ (string) (len=8) \"/init.sh\"
+ },
WorkingDir: (string) \"\",
@@ -448,3 +448,3 @@
},
- VolumeMounts: ([]v1.VolumeMount) (len=1) {
+ VolumeMounts: ([]v1.VolumeMount) (len=2) {
(v1.VolumeMount) {
@@ -454,2 +454,10 @@
SubPath: (string) \"\",
+ MountPropagation: (*v1.MountPropagationMode)(<nil>),
+ SubPathExpr: (string) \"\"
+ },
+ (v1.VolumeMount) {
+ Name: (string) (len=13) \"milvus-config\",
+ ReadOnly: (bool) true,
+ MountPath: (string) (len=25) \"/milvus/configs/user.yaml\",
+ SubPath: (string) (len=9) \"user.yaml\",
MountPropagation: (*v1.MountPropagationMode)(<nil>),
@@ -475,3 +483,3 @@
Name: (string) (len=5) \"proxy\",
- Image: (string) (len=23) \"milvusdb/milvus:v2.3.10\",
+ Image: (string) (len=23) \"milvusdb/milvus:v2.3.11\",
Command: ([]string) <nil>,
"}
I mean the log of milvus pods, not milvus-operator's. Operator works just as expected. It's likely a bug of milvus. So we need its log to dig in.
Hi, @haorenfsa @weiliu1031 To obtain accurate logs for the specific range, I tried upgrading another cluster (to 2.3.12). I've attached the logs of the old version pods of each component during the rolling upgrade. Is this what you wanted? Since we configured two replicas for each component, there are two pods per component.
Additionally, I have two more questions. First, there was no problem with the upgrade from 2.3.10 to 2.3.11 (except for the mixcoord hangup), but after the upgrade to 2.3.12,
[2024/03/21 06:04:25.302 +00:00] [WARN] [broker/datacoord.go:102] ["failed to update channel checkpoint"] [channelNames="[ai-rootcoord-dml_3_447872829707677642v1,ai-rootcoord-dml_1_447872829696987074v0]"] [channelCheckpointTimes="[2024/03/21 06:04:22.246 +00:002024/03/21 06:03:48.845 +00:00]"] [error="channelCP is nil, vChannel="]
[2024/03/21 06:04:25.302 +00:00] [WARN] [datanode/channel_checkpoint_updater.go:129] ["update channel checkpoint failed"] [error="channelCP is nil, vChannel="]
this error occurred. Initially, I thought it was a problem specific to the situation of a particular cluster, but seeing it occur again after this upgrade makes me wonder if it's a problem with 2.3.12. It seems related to a Milvus issue, similar to the ticket at https://github.com/milvus-io/milvus/issues/30004 The log shows it as a warning, but is it not a critical issue for the service?
Another thing, as you can see from the attached querynode log, there's a significant difference in terminating times. One pod terminates quickly without any issues, but the other takes about 30 minutes to terminate. Why is that? Considering the physical memory usage is the same for both pods, it doesn't seem like a balancing issue, does it?
Hi, @haorenfsa @weiliu1031 To obtain accurate logs for the specific range, I tried upgrading another cluster (to 2.3.12). I've attached the logs of the old version pods of each component during the rolling upgrade. Is this what you wanted? Since we configured two replicas for each component, there are two pods per component.
Additionally, I have two more questions. First, there was no problem with the upgrade from 2.3.10 to 2.3.11 (except for the mixcoord hangup), but after the upgrade to 2.3.12,
[2024/03/21 06:04:25.302 +00:00] [WARN] [broker/datacoord.go:102] ["failed to update channel checkpoint"] [channelNames="[ai-rootcoord-dml_3_447872829707677642v1,ai-rootcoord-dml_1_447872829696987074v0]"] [channelCheckpointTimes="[2024/03/21 06:04:22.246 +00:002024/03/21 06:03:48.845 +00:00]"] [error="channelCP is nil, vChannel="] [2024/03/21 06:04:25.302 +00:00] [WARN] [datanode/channel_checkpoint_updater.go:129] ["update channel checkpoint failed"] [error="channelCP is nil, vChannel="]
this error occurred. Initially, I thought it was a problem specific to the situation of a particular cluster, but seeing it occur again after this upgrade makes me wonder if it's a problem with 2.3.12. It seems related to a Milvus issue, similar to the ticket at https://github.com/milvus-io/milvus/issues/30004 The log shows it as a warning, but is it not a critical issue for the service?
Another thing, as you can see from the attached querynode log, there's a significant difference in terminating times. One pod terminates quickly without any issues, but the other takes about 30 minutes to terminate. Why is that? Considering the physical memory usage is the same for both pods, it doesn't seem like a balancing issue, does it?
something wrong with the log above, i saw the mixcoord-1.txt
and mixcoord-2.txt
, it shows that mixcoord-1 and mixcoord-2 are all stucked at STANDBY state. which means the active mixcoord isn't the mixcoord-1 or mixcoord-2, or i got a duplicate logs of standby mixcoord.
if something stuck at rolling upgrade progress, which means the old active coord/node doesn't exist as expected. and i need logs about the old active coord/node, then i'fill figure out what happens here
Hi, @haorenfsa @weiliu1031 To obtain accurate logs for the specific range, I tried upgrading another cluster (to 2.3.12). I've attached the logs of the old version pods of each component during the rolling upgrade. Is this what you wanted? Since we configured two replicas for each component, there are two pods per component.
Additionally, I have two more questions. First, there was no problem with the upgrade from 2.3.10 to 2.3.11 (except for the mixcoord hangup), but after the upgrade to 2.3.12,
[2024/03/21 06:04:25.302 +00:00] [WARN] [broker/datacoord.go:102] ["failed to update channel checkpoint"] [channelNames="[ai-rootcoord-dml_3_447872829707677642v1,ai-rootcoord-dml_1_447872829696987074v0]"] [channelCheckpointTimes="[2024/03/21 06:04:22.246 +00:002024/03/21 06:03:48.845 +00:00]"] [error="channelCP is nil, vChannel="] [2024/03/21 06:04:25.302 +00:00] [WARN] [datanode/channel_checkpoint_updater.go:129] ["update channel checkpoint failed"] [error="channelCP is nil, vChannel="]
this error occurred. Initially, I thought it was a problem specific to the situation of a particular cluster, but seeing it occur again after this upgrade makes me wonder if it's a problem with 2.3.12. It seems related to a Milvus issue, similar to the ticket at https://github.com/milvus-io/milvus/issues/30004 The log shows it as a warning, but is it not a critical issue for the service?
Another thing, as you can see from the attached querynode log, there's a significant difference in terminating times. One pod terminates quickly without any issues, but the other takes about 30 minutes to terminate. Why is that? Considering the physical memory usage is the same for both pods, it doesn't seem like a balancing issue, does it?
above the two problems above:
@weiliu1031
The mixcoord-x.log file I provided is definitely the old active coord/node. That's what we're wondering about as well. According to the log, it only executes ["Get signal to exit"] [signal=terminated] at [roles/roles.go:285] and there are no further actions. The question is why it only executes the signal and doesn't proceed any further.
The speculation is that the node appears in terminating state because it has received an exit signal.
@weiliu1031
Does "additional active mixcoord log" for the querynode issue refer to the new active mixcoord log?
@weiliu1031
The mixcoord-x.log file I provided is definitely the old active coord/node. That's what we're wondering about as well. According to the log, it only executes ["Get signal to exit"] [signal=terminated] at [roles/roles.go:285] and there are no further actions. The question is why it only executes the signal and doesn't proceed any further.
The speculation is that the node appears in terminating state because it has received an exit signal.
both mixcoord-1
and mixcoord-2
are in Standby
status, but querynode are executing some tasks, suck as loadSegment, which means there eixst a Active mixcoord beside the StandBy mixcoord-1
and StandBy mixcoord-1
@weiliu1031
We are operating with only 2 replicas, so there is no additional mixcoord pod.
I'm attaching additional captures of the pod screen from ArgoCD and the pod list from kubectl.
ubectl get pod -n ai-milvus
NAME READY STATUS RESTARTS AGE
ai-etcd-0 1/1 Running 0 33d
ai-etcd-1 1/1 Running 0 33d
ai-etcd-2 1/1 Running 0 33d
ai-kafka-0 1/1 Running 2 (33d ago) 33d
ai-kafka-1 1/1 Running 2 (33d ago) 33d
ai-kafka-2 1/1 Running 2 (33d ago) 33d
ai-kafka-zookeeper-0 1/1 Running 0 33d
ai-kafka-zookeeper-1 1/1 Running 0 33d
ai-milvus-attu-8f49bb85f-c57qr 1/1 Running 3 (4h58m ago) 18d
ai-milvus-datanode-748759898c-b7gh8 1/1 Running 0 4d4h
ai-milvus-datanode-748759898c-ckzh9 1/1 Running 0 4d4h
ai-milvus-indexnode-856b87865-ldh4r 1/1 Running 0 4d4h
ai-milvus-indexnode-856b87865-zgqdg 1/1 Running 0 4d4h
ai-milvus-mixcoord-5b7b68865-547sk 1/1 Terminating 0 5d22h
ai-milvus-mixcoord-5b7b68865-lgcs8 1/1 Terminating 0 5d22h
ai-milvus-mixcoord-6c7b99564d-p59ml 1/1 Running 0 4d4h
ai-milvus-mixcoord-6c7b99564d-wbwfv 1/1 Running 0 4d4h
ai-milvus-proxy-746f9fcd96-5w5k2 1/1 Running 0 4d4h
ai-milvus-proxy-746f9fcd96-wl7j7 1/1 Running 0 4d4h
ai-milvus-querynode-0-699d99445b-4cghf 1/1 Running 0 4d4h
ai-milvus-querynode-0-699d99445b-j9fvn 1/1 Running 0 4d4h
As you can see, the old mixcoord pod is still in terminating state, and only 2 replicas of the new mixcoord pod exist. Considering there are no issues with the service, I thought that one of the 2 new mixcoord pod replicas must be active or there might be a problem with the log recording.
However, the logs from the 2 new mixcoord pod replicas were not different from those of mixcoord-1 and mixcoord-2 when there was no traffic. I am thinking this way because even the new mixcoord pod that appears to be active is often recording standby logs.
[2024/03/25 07:10:30.988 +00:00] [INFO] [observers/target_observer.go:473] ["observer trigger update current target"] [collectionID=447872829707677642]
2024-03-25 16:10:30.988
[2024/03/25 07:10:30.988 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=7353d61229ee9c0e5f3312d27227a34b] [collectionID=447872829696987074]
2024-03-25 16:10:30.987
[2024/03/25 07:10:30.987 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=8b345101e7fc8216db8865e2ef183014] [collectionID=447872829707677642]
2024-03-25 16:10:30.986
[2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=bf42ba3792f32f5e478abdc3a2acc1ea] [collectionID=447872829696987074]
2024-03-25 16:10:30.986
[2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=c6087e83b2af9794df7ade62f8283a09] [collectionID=447872829707677642]
2024-03-25 16:10:30.985
[2024/03/25 07:10:30.985 +00:00] [INFO] [observers/target_observer.go:450] ["Update readable segment version"] [collectionID=447872829707677642] [channelName=ai-rootcoord-dml_2_447872829707677642v0] [nodeID=72] [oldVersion=1711350610988876132] [newVersion=1711350620988345392]
2024-03-25 16:10:30.628
[2024/03/25 07:10:30.628 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: querycoord is in STANDBY ..."]
2024-03-25 16:10:30.208
[2024/03/25 07:10:30.208 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: datacoord is in STANDBY ..."]
2024-03-25 16:10:29.875
[2024/03/25 07:10:29.875 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: rootcoord is in STANDBY ..."]
2024-03-25 16:10:25.265
[2024/03/25 07:10:25.265 +00:00] [DEBUG] [config/refresher.go:70] ["etcd refreshConfigurations"] [prefix=ai/config] [endpoints="[ai-etcd.ai-milvus:2379]"]
2024-03-25 16:10:20.989
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:799] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"] [channel=ai-rootcoord-dml_1_447872829696987074v0] ["# of unflushed segments"=0] ["# of flushed segments"=5] ["# of dropped segments"=0] ["# of indexed segments"=0]
2024-03-25 16:10:20.989
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:299] ["channel seek position set from channel checkpoint meta"] [channel=ai-rootcoord-dml_1_447872829696987074v0] [posTs=448620282719502338] [posTime=2024/03/25 07:09:25.794 +00:00]
2024-03-25 16:10:20.989
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:116] [GetQueryVChanPositions] [collectionID=447872829696987074] [channel=ai-rootcoord-dml_1_447872829696987074v0] [numOfSegments=5] ["indexed segment"=5]
2024-03-25 16:10:20.989
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:784] ["get recovery info request received"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"]
Even now, both pods in another cluster are recording standby logs, but there is absolutely no impact on the service. (The reason for no terminating pods in the capture is because I performed a force delete.)
ai-milvus-mixcoord-76599769-1.txt ai-milvus-mixcoord-76599769-2.txt
I have been questioning why both mixcoord pod logs output standby for some time.
@weiliu1031
We are operating with only 2 replicas, so there is no additional mixcoord pod.
I'm attaching additional captures of the pod screen from ArgoCD and the pod list from kubectl.
ubectl get pod -n ai-milvus NAME READY STATUS RESTARTS AGE ai-etcd-0 1/1 Running 0 33d ai-etcd-1 1/1 Running 0 33d ai-etcd-2 1/1 Running 0 33d ai-kafka-0 1/1 Running 2 (33d ago) 33d ai-kafka-1 1/1 Running 2 (33d ago) 33d ai-kafka-2 1/1 Running 2 (33d ago) 33d ai-kafka-zookeeper-0 1/1 Running 0 33d ai-kafka-zookeeper-1 1/1 Running 0 33d ai-milvus-attu-8f49bb85f-c57qr 1/1 Running 3 (4h58m ago) 18d ai-milvus-datanode-748759898c-b7gh8 1/1 Running 0 4d4h ai-milvus-datanode-748759898c-ckzh9 1/1 Running 0 4d4h ai-milvus-indexnode-856b87865-ldh4r 1/1 Running 0 4d4h ai-milvus-indexnode-856b87865-zgqdg 1/1 Running 0 4d4h ai-milvus-mixcoord-5b7b68865-547sk 1/1 Terminating 0 5d22h ai-milvus-mixcoord-5b7b68865-lgcs8 1/1 Terminating 0 5d22h ai-milvus-mixcoord-6c7b99564d-p59ml 1/1 Running 0 4d4h ai-milvus-mixcoord-6c7b99564d-wbwfv 1/1 Running 0 4d4h ai-milvus-proxy-746f9fcd96-5w5k2 1/1 Running 0 4d4h ai-milvus-proxy-746f9fcd96-wl7j7 1/1 Running 0 4d4h ai-milvus-querynode-0-699d99445b-4cghf 1/1 Running 0 4d4h ai-milvus-querynode-0-699d99445b-j9fvn 1/1 Running 0 4d4h
As you can see, the old mixcoord pod is still in terminating state, and only 2 replicas of the new mixcoord pod exist. Considering there are no issues with the service, I thought that one of the 2 new mixcoord pod replicas must be active or there might be a problem with the log recording.
However, the logs from the 2 new mixcoord pod replicas were not different from those of mixcoord-1 and mixcoord-2 when there was no traffic. I am thinking this way because even the new mixcoord pod that appears to be active is often recording standby logs.
[2024/03/25 07:10:30.988 +00:00] [INFO] [observers/target_observer.go:473] ["observer trigger update current target"] [collectionID=447872829707677642] 2024-03-25 16:10:30.988 [2024/03/25 07:10:30.988 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=7353d61229ee9c0e5f3312d27227a34b] [collectionID=447872829696987074] 2024-03-25 16:10:30.987 [2024/03/25 07:10:30.987 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=8b345101e7fc8216db8865e2ef183014] [collectionID=447872829707677642] 2024-03-25 16:10:30.986 [2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=bf42ba3792f32f5e478abdc3a2acc1ea] [collectionID=447872829696987074] 2024-03-25 16:10:30.986 [2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=c6087e83b2af9794df7ade62f8283a09] [collectionID=447872829707677642] 2024-03-25 16:10:30.985 [2024/03/25 07:10:30.985 +00:00] [INFO] [observers/target_observer.go:450] ["Update readable segment version"] [collectionID=447872829707677642] [channelName=ai-rootcoord-dml_2_447872829707677642v0] [nodeID=72] [oldVersion=1711350610988876132] [newVersion=1711350620988345392] 2024-03-25 16:10:30.628 [2024/03/25 07:10:30.628 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: querycoord is in STANDBY ..."] 2024-03-25 16:10:30.208 [2024/03/25 07:10:30.208 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: datacoord is in STANDBY ..."] 2024-03-25 16:10:29.875 [2024/03/25 07:10:29.875 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: rootcoord is in STANDBY ..."] 2024-03-25 16:10:25.265 [2024/03/25 07:10:25.265 +00:00] [DEBUG] [config/refresher.go:70] ["etcd refreshConfigurations"] [prefix=ai/config] [endpoints="[ai-etcd.ai-milvus:2379]"] 2024-03-25 16:10:20.989 [2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:799] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"] [channel=ai-rootcoord-dml_1_447872829696987074v0] ["# of unflushed segments"=0] ["# of flushed segments"=5] ["# of dropped segments"=0] ["# of indexed segments"=0] 2024-03-25 16:10:20.989 [2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:299] ["channel seek position set from channel checkpoint meta"] [channel=ai-rootcoord-dml_1_447872829696987074v0] [posTs=448620282719502338] [posTime=2024/03/25 07:09:25.794 +00:00] 2024-03-25 16:10:20.989 [2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:116] [GetQueryVChanPositions] [collectionID=447872829696987074] [channel=ai-rootcoord-dml_1_447872829696987074v0] [numOfSegments=5] ["indexed segment"=5] 2024-03-25 16:10:20.989 [2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:784] ["get recovery info request received"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"]
Even now, both pods in another cluster are recording standby logs, but there is absolutely no impact on the service. (The reason for no terminating pods in the capture is because I performed a force delete.)
ai-milvus-mixcoord-76599769-1.txt ai-milvus-mixcoord-76599769-2.txt
I have been questioning why both mixcoord pod logs output standby for some time.
please give the logs of the above pods ai-milvus-mixcoord-5b7b68865-547sk
and ai-milvus-mixcoord-5b7b68865-lgcs8
, which should hold the active session lock
@weiliu1031
The logs for that pod are mixcoord-1 and mixcoord-2.
@weiliu1031
The logs for that pod are mixcoord-1 and mixcoord-2.
please upload 4 mixcoord's logs
@weiliu1031
This is the log of the currently running mixcoord node at the time of the upgrade.
@weiliu1031
This is the log of the currently running mixcoord node at the time of the upgrade.
for rootcoord, there has 12 alive sessions, but the active one is still the oldest pod with serverId=10
and address=10.4.31.99:53100
, all the rootcoord behind it stucks at STANDBY
mode, cause the oldest one still hold the active session, find the oldest mixcoord by address and upload the log.
whatever you have done, the oldest mixcoord doesn't exit, the new start mixcoord will keep STANDBY
forever
@weiliu1031
cause the oldest one still hold the active session
Why is it? The oldest mixcoord no longer exist, why does it still hold the active session? is it a bug?
@weiliu1031
cause the oldest one still hold the active session
Why is it? The oldest mixcoord no longer exist, why does it still hold the active session? is it a bug?
if the oldest mixcoord doesn't exist, it should be a bug. but the keep alive intervel for session's lease is about 30s, which means the oldest mixcoord's session should be auto released after 30s, but after a few days it still exists, so i guess the oldest mixcoord doesn't exit
@weiliu1031
It's true that there was a pod with that IP. However, as shown in the attached image, it's a pod that was terminated on March 19th. This means that the pod didn't exist at the time of the upgrade. How did the rootcoord know to connect, and how could it possibly leave a log indicating that the connection was successful?
@weiliu1031
It's true that there was a pod with that IP. However, as shown in the attached image, it's a pod that was terminated on March 19th. This means that the pod didn't exist at the time of the upgrade. How did the rootcoord know to connect, and how could it possibly leave a log indicating that the connection was successful?
milvus use etcd lease impl session, and it have a keepalive policy for every 30s. for the latest rootcoord,it get all sessions from etcd, and it found the oldest's coord still hold the session, which means the oldest mixcoord doesn't exit
@ian-lss did you use kubectl delete --force
to delete that pod? in this case it's possible that the container is still running on the worker node holding the active session, without being detected.
Could you login to that node and check if the mixcoord container is still running?
@weiliu1031 @haorenfsa
If my memory serves correctly, this is a pod that was force deleted for the upgrade to version 2.3.11, before the upgrade to 2.3.12. Could this be the cause? If so, does this mean that mixcoord should not force delete terminating pods? The reason for the force delete was because, similarly, the mixcoord pod did not terminate due to a previous upgrade and remained in a terminating state.
As mentioned above, after 3/19, the Pod does not physically exist. The upgrade was on 3/21,
@weiliu1031 @haorenfsa
If my memory serves correctly, this is a pod that was force deleted for the upgrade to version 2.3.11, before the upgrade to 2.3.12. Could this be the cause? If so, does this mean that mixcoord should not force delete terminating pods? The reason for the force delete was because, similarly, the mixcoord pod did not terminate due to a previous upgrade and remained in a terminating state.
As mentioned above, after 3/19, the Pod does not physically exist. The upgrade was on 3/21,
yes, force delete caused the oldest querycoord exist until now, you may need login to the node to cleanup the unexpected pod. which is actually play as the Active mixcoord now.
@weiliu1031
What does 'you may need to log in to the node to clean up the unexpected pod`
mean?
The initial reason we had to force delete was due to a GKE issue that resulted in one of the mixcoords being deleted and recreated (with its IP changed). Since then, terminating pods have started to appear with every upgrade. This situation seems like it could occur frequently during operations. Does this mean that from this point on, there will inevitably be terminating pods with every upgrade, implying that a regular cleanup will always be necessary?
@weiliu1031
What does 'you may need to log in to the node to clean up the unexpected pod`
mean?
The initial reason we had to force delete was due to a GKE issue that resulted in one of the mixcoords being deleted and recreated (with its IP changed). Since then, terminating pods have started to appear with every upgrade. This situation seems like it could occur frequently during operations. Does this mean that from this point on, there will inevitably be terminating pods with every upgrade, implying that a regular cleanup will always be necessary?
the answer is no, the kubectl delete --force mixcoord
operation cause a unexpected mixcoord, which hold the session until now, just remove it. the future rolling upgrade will back to normal.
@weiliu1031
I don't understand what you mean by remove the session.'What session should be removed, and how?
@weiliu1031
I don't understand what you mean by remove the session.'What session should be removed, and how?
the oldest mixcoord with address address=10.4.31.99:53100
is still alive on the node, you need to shutdown it. you have executed kubectl delete --force
on this container, but the container is still running on the worker node holding the active session, without being detected.
@weiliu1031 I keep informing you that the pod with IP 10.4.31.99 was physically deleted two days before the upgrade, and it no longer exists. That's why I was wondering what needs to be removed.
@weiliu1031 I keep informing you that the pod with IP 10.4.31.99 was physically deleted two days before the upgrade, and it no longer exists. That's why I was wondering what needs to be removed.
when you execute kubectl delete --force
, the pod has been removed from k8s's apiserver, but the container is still alive, and won't be detected by k8s
If my memory serves correctly, this is a pod that was force deleted for the upgrade to version 2.3.11, before the upgrade to 2.3.12. Could this be the cause?
Yes, it's the cause
If so, does this mean that mixcoord should not force delete terminating pods?
Yes, force delete only stops k8s from tracking the pod's lifecycle. It won't actually guarantee the deletion of the pod on machine.
The reason for the force delete was because, similarly, the mixcoord pod did not terminate due to a previous upgrade and remained in a terminating state.
In this case, if you want it to end in a short time you could either:
kubectl delete $podName --grace-period=1
or
kubectl exec -it $podName -- bash -c 'pid=$(ps -C milvus -o pid=); kill -9 $pid'
I keep informing you that the pod with IP 10.4.31.99 was physically deleted two days before the upgrade, and it no longer exists. That's why I was wondering what needs to be removed.
As my above description, the 10.4.31.99
is only logically disappeared from k8s' controll plane, but it may still be running on the worker node machine.
You can login to the machine, if possible, to verify this theory. If it's a cloud provider's vm or something, try getting it offline or reboot.
@haorenfsa @weiliu1031
Thank you. I have completely understood the phenomenon you explained. I also understand why the "Display Error issue close" occurred. And, I remember that we performed a force delete during the upgrade process, but not from the beginning, though I'm not entirely certain of my memory. Therefore, I will test a few pod terminate situations in GKE and inform you.
@haorenfsa @weiliu1031
When attempting to delete and recreate one of the mixcoord pods by running the kubectl delete command, a new pod is created, and the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord. In the current situation, since the active mixcoord is in service, we cannot resolve the terminating situation by deleting or restarting the active mixcoord, similar to what we would do with a force delete. How should a general mixcoord pod deletion be handled?
It seems like we began to force delete from this point.
When attempting to delete and recreate one of the mixcoord pods by running the kubectl delete command, a new pod is created, and the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord.
@weiliu1031 This behavior is not as expect, we expect the terminating mixcoord pod to exit by its own.
@haorenfsa @weiliu1031
When attempting to delete and recreate one of the mixcoord pods by running the kubectl delete command, a new pod is created, and the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord. In the current situation, since the active mixcoord is in service, we cannot resolve the terminating situation by deleting or restarting the active mixcoord, similar to what we would do with a force delete. How should a general mixcoord pod deletion be handled?
It seems like we began to force delete from this point.
what do you mean by the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord.
? why old mixcoord connected to active mixcoord? upload mixcoord logs please
@weiliu1031 @haorenfsa
I completely created a new cluster and set up two replicas of mixcoord. Afterwards, I executed a delete command on mixcoord-terminating_old_pod2.txt, and mixcoord-newpod2.txt was newly created. Nevertheless, mixcoord-terminating_old_pod2 remains in a terminating state. Among the contents of mixcoord-terminating_old_pod2.txt, there is ["RootCoordClient GetSessions success"] [address=10.3.22.115:53100]. This IP belongs to mixcoord-pod1. Considering what you said about force deleting before, I thought that session might be the reason why mixcoord-terminating_old_pod2 is still in a terminating state. So, my question was, "How do you delete a pod?"
mixcoord-pod1.txt mixcoord-terminating_old_pod2.txt mixcoord-newpod2.txt
@ian-lss
So, my question was, "How do you delete a pod?"
As I mentioned above:
In this case, if you want it to end in a short time you could either:
kubectl delete $podName --grace-period=1
or
kubectl exec -it $podName -- bash -c 'pid=$(ps -C milvus -o pid=); kill -9 $pid'
@haorenfsa
Right. I haven't tried it yet, but I thought it would work with kubectl delete $podName --grace-period=1 or by killing the process. Therefore, within the operator, I set the TerminationGracePeriodSeconds value to always be 1 month, but I'm wondering why it was set to such a lengthy duration. From the comments alone, the choice of a one-month period doesn't make sense to me.
// oneMonthSeconds we set both podtemplate.spec.terminationGracePeriodSeconds & // deployment.spec.progressDeadlineSeconds to one month, to avoid kill -9 on pod automatically. // so that when pod stuck on rolling, the service will still be available. // We'll have enough time to find the root cause and handle it gracefully.
Personally, even one week seems too long. Can't we make it possible for users to specify this duration in the config?
but I'm wondering why it was set to such a lengthy duration. From the comments alone, the choice of a one-month period doesn't make sense to me.
The mixcoord is exptected to exit by itself
within a couple of minutes
. So in most cases the terminationGracePeriodSeconds
really doesn't matters.
Another case is that it failed to exit by itself, as you encountered. In this case, we're sure something unexpected happened. So there must be human involved to handle it. If we set it too short, the Pod will be killed before one get a chance to react. The service will be severely affected when a pod gets killed.
So we never want the kill process
to happen. But the value can only be decreased, not increased, thus we set it to almost infinite to make sure one enough time to react, more degree of freedom.
If you find a Pod stucks in terminating, and you're sure that you can accept the downtime to kill it. You can always kill it in at the time you think appropriate.
Personally, even one week seems too long. Can't we make it possible for users to specify this duration in the config?
If you make a patch to add a config field, we will merge it. But I don't think it worth the time. If the mixcoord stucks, it's a bug. We should fix it in Milvus.
Hello, @weiliu1031 I’m wondering about the status of this issue. I upgraded from 2.3.12 to 2.4.4 using version 0.9.15, but the same issue is still occurring. There are no issues when applying spec changes or config changes other than the upgrade. cc : @haorenfsa
Hi @ian-lss, as far as I know, it's been completely solved in 2.3.16+ or 2.4.3+. @weiliu1031 any comments?
@ian-lss Since you upgraded from 2.3.12, it's as expected you may still run into this case. It shall be fine afterwards.
By the way, it now make sense to me to make terminationGracePeriodSeconds
configurable, since I don't have much confidence that this issue will 100% not occur again in Milvus. It will be available in next version.
@haorenfsa Thank you for your response. I will proceed with the upgrade in the next version and let you know the results.
@haorenfsa I upgraded to version 2.4.5 this time. As you mentioned, the mixcoordi pod has been neatly resolved. Thank you for your response during this time.
Hi, I'm encountering an issue where the version upgrade via the operator is not working. I'm upgrading from 2.3.8 to 2.3.10, and there are two issues:
The yaml format is written by copy and paste, so it's completely identical. Do you have any idea what the problem might be?