zilliztech / milvus-operator

The Kubernetes Operator of Milvus.
https://milvus.io
Apache License 2.0
45 stars 23 forks source link

For the issue where the upgrade does not complete: #88

Open ian-lss opened 7 months ago

ian-lss commented 7 months ago

Hi, I'm encountering an issue where the version upgrade via the operator is not working. I'm upgrading from 2.3.8 to 2.3.10, and there are two issues:

  1. Although the upgrade is complete, the version information displayed in attu still shows 2.3.8. However, when checked with argocd, each component version is 2.3.10. In the system info. view of attu, it still shows that the data coord is version 2.3.8.
  2. I proceeded with the upgrade on four clusters, three of which were successful (two of them have the same issue as described in point 1), but one cluster only had the index node successfully upgraded and is not progressing at all. There is also no error/debug information in the milvus-operator log.

The yaml format is written by copy and paste, so it's completely identical. Do you have any idea what the problem might be?

haorenfsa commented 7 months ago

@ian-lss could you provide milvus-operator's log? info & warn level logs can also provide much information.

haorenfsa commented 6 months ago

https://github.com/zilliztech/attu/issues/437

ian-lss commented 6 months ago

hi, @haorenfsa

Sorry, but there are two issues here. One is the issue at https://github.com/zilliztech/attu/issues/437, and the other is an issue where the upgrade only applies up to the index node. For the part where the upgrade doesn't happen, there are some parts I don't understand, so I'm making changes in the code to leave more detailed logs to definitely clarify and inform about the problem. I'll let you know with the logs after testing, so please wait a little longer.

haorenfsa commented 6 months ago

and the other is an issue where the upgrade only applies up to the index node.

I believe the upgrade process is quite well tested. I'll wait for your logs to see what's going on there.

ian-lss commented 6 months ago

hi, @haorenfsa

I added a log print to the milvus-operator Source and tested it. In conclusion, I confirmed that it is operating normally. I'm very sorry for the confusion. The reason I thought it was not functioning as I had reported in the issue was due to a hidden mixcoord terminating component issue. I have a question: Even though TerminationGracePeriodSeconds is applied to each component in the updateSomeFieldsOnlyWhenRolling function, template.Spec.TerminationGracePeriodSeconds = int64Ptr(int64(oneMonthSeconds)), After the upgrade, all pods of components except the mixcoord component terminate successfully, but the mixcoord component does not terminate and remains in the terminating state for an entire month. Is this situation intended? Therefore, in the current situation where milvus upgrades are frequent (a new version is being released within a month), we have no choice but to manually force delete the mixcoord terminating component and then upgrade. Is this operation method of manually forcing deletion of the mixcoord terminating component and then upgrading intended?

Below is a part of the log from the mixcoord in terminating state.

[2024/03/20 02:55:28.039 +00:00] [DEBUG] [sessionutil/session_util.go:1043] ["serverName: rootcoord is in STANDBY ..."]
[2024/03/20 02:55:28.836 +00:00] [DEBUG] [sessionutil/session_util.go:1043] ["serverName: querycoord is in STANDBY ..."]
[2024/03/20 02:55:28.836 +00:00] [DEBUG] [sessionutil/session_util.go:1043] ["serverName: datacoord is in STANDBY ..."]
[2024/03/20 02:55:32.347 +00:00] [DEBUG] [config/refresher.go:70] ["etcd refreshConfigurations"] [prefix=ai/config] [endpoints="[ai-etcd.ai-milvus:2379]"]
haorenfsa commented 6 months ago

but the mixcoord component does not terminate and remains in the terminating state for an entire month. Is this situation intended?

No, we expect the old mixcoord to change to standby, and then exit by itself. If a mixcoord hangs in its terminating state, it's likely to be a milvus issue.

Could you also provide us the log before it's teminating ? And is it only happens when upgrade 2.3.8 to newer release?

@weiliu1031 any knew issue in 2.3.8 that mixcoord hangs in terminating?

weiliu1031 commented 6 months ago

but the mixcoord component does not terminate and remains in the terminating state for an entire month. Is this situation intended?

No, we expect the old mixcoord to change to standby, and then exit by itself. If a mixcoord hangs in its terminating state, it's likely to be a milvus issue.

Could you also provide us the log before it's teminating ? And is it only happens when upgrade 2.3.8 to newer release?

@weiliu1031 any knew issue in 2.3.8 that mixcoord hangs in terminating?

31213 #31447 both have the issue which may hang rootcoord for about 1minutes. if the rolling upgrade progress stucks, please upload the whole cluster log during rolling upgrade for investigating

ian-lss commented 6 months ago

@haorenfsa Before shutdown, which logs are referred to? I had to force terminate mixcoord due to a version upgrade. The problematic versions are all versions after using milvus-operator: 2.3.8, 2.3.10, 2.3.11, 2.3.12.

@weiliu1031 Which logs are needed if the rolling upgrade process is interrupted? Below is a summary of some logs at the time of the mixcoord hangup.

To briefly explain the analysis,

in the case of failure: ReconcileComponentDeployment(deployments.go) -> updateDeployment(deployments.go) -> CheckComponentHasTerminatingPod(component_condition.go)->HasTerminatingPodByListOpts(component_condition.go)->true return->updateDeploymentWithoutPodTemplate(deployment_updater.go)->IsEqual(old, cur) return true(deployments.go)->the end

In the case of success: ReconcileComponentDeployment(deployments.go) -> updateDeployment(deployments.go) -> CheckComponentHasTerminatingPod(component_condition.go)->HasTerminatingPodByListOpts(component_condition.go)->false return->updateDeployment(deployment_updater.go)->updateDeploymentWithoutPodTemplate(deployment_updater.go)->updatePodTemplate(deployment_updater.go)->updateMilvusContainer(deployment_updater.go)->RollingUpdateImageDependencyReady(deployment_updater.go)->GetDependencies(deployment_updater.go)->IsImageUpdated(components.go)->GetState(milvus_types.go)-> return true -> container.Image = mergedComSpec.Image ->IsEqual(currentTemplate, template)->updateSomeFieldsOnlyWhenRolling->IsEqual(old, cur) return false(deployments.go)->Update(k8s sdk)->the end

2024-03-19T03:08:10.607Z    INFO    controller.status-syncer    syncUnealthyOrUpdating end  {"duration": "61.54914ms"}

2024-03-19T03:08:10.610Z    INFO    controller.milvus   Reconcile Component {"name": "vs", "component": "mixcoord", "mode": "cluster"}
2024-03-19T03:08:10.610Z    INFO    controller.milvus   Start Update Deployment {"name": "vs-milvus-mixcoord"}
2024-03-19T03:08:10.610Z    INFO    controller.milvus   [MilvusReconciler.updateDeployment] component has terminating pod, skip update template {"component": "mixcoord"}
2024-03-19T03:08:10.611Z    INFO    controller.milvus   nothing diff context    {"name": "vs-milvus-mixcoord", "namespace": "milvus"}

2024-03-19T03:08:10.610Z    INFO    controller.milvus   Reconcile Component {"name": "vs", "component": "standalone", "mode": "cluster"}
2024-03-19T03:08:10.610Z    INFO    controller.milvus   Start Update Deployment {"name": "vs-milvus-standalone"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "standalone"}
2024-03-19T03:08:10.610Z    INFO    GetDependencies {"depGraph": null, "component": {"Name":"standalone","FieldName":"Standalone","DefaultPort":19530}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep count": 0, "component": {"Name":"standalone","FieldName":"Standalone","DefaultPort":19530}}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "standalone"}
2024-03-19T03:08:10.610Z    INFO    updatePodTemplate   {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "standalone"}
2024-03-19T03:08:10.611Z    INFO    controller.milvus   nothing diff context    {"name": "vs-milvus-standalone", "namespace": "milvus"}

2024-03-19T03:08:10.610Z    INFO    controller.milvus   Reconcile Component {"name": "vs", "component": "proxy", "mode": "cluster"}
2024-03-19T03:08:10.610Z    INFO    controller.milvus   Start Update Deployment {"name": "vs-milvus-proxy"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "proxy"}
2024-03-19T03:08:10.610Z    INFO    GetDependencies {"depGraph": [{"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}], "component": {"Name":"proxy","FieldName":"Proxy","DefaultPort":19530}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep count": 1, "component": {"Name":"proxy","FieldName":"Proxy","DefaultPort":19530}}
2024-03-19T03:08:10.611Z    INFO    RollingUpdateImageDependencyReady   {"dep": "datanode:DataNode", "component": {"Name":"proxy","FieldName":"Proxy","DefaultPort":19530}}
2024-03-19T03:08:10.611Z    INFO    IsImageUpdated  {"ComponentsDeployStatus": 6, "name": "datanode", "component": "proxy"}
2024-03-19T03:08:10.611Z    INFO    IsImageUpdated  {"m.Spec.Com.Image": "milvusdb/milvus:v2.3.11", "deployStatus.Image": "milvusdb/milvus:v2.3.11", "name": "datanode", "component": "proxy"}
2024-03-19T03:08:10.611Z    INFO    IsImageUpdated  {"deployStatus.GetState()": "Complete", "name": "datanode", "component": "vs-milvus.milvus:19530", "component": "proxy"}
2024-03-19T03:08:10.611Z    INFO    updateMilvusContainer   {"container.Image": "milvusdb/milvus:v2.3.10", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "proxy"}
2024-03-19T03:08:10.611Z    INFO    updatePodTemplate   {"IsEqual template": false, "image": "milvusdb/milvus:v2.3.11", "name": "proxy"}
2024-03-19T03:08:10.611Z    INFO    updatePodTemplate   {"updateSomeFieldsOnlyWhenRolling": "Exec", "name": "proxy"}

2024-03-19T03:08:10.610Z    INFO    controller.milvus   Reconcile Component {"name": "vs", "component": "datanode", "mode": "cluster"}
2024-03-19T03:08:10.610Z    INFO    controller.milvus   Start Update Deployment {"name": "vs-milvus-datanode"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "datanode"}
2024-03-19T03:08:10.610Z    INFO    GetDependencies {"depGraph": [{"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}], "component": {"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep count": 1, "component": {"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep": "querynode:QueryNode", "component": {"Name":"datanode","FieldName":"DataNode","DefaultPort":21124}}
2024-03-19T03:08:10.610Z    INFO    IsImageUpdated  {"ComponentsDeployStatus": 6, "name": "querynode", "component": "datanode"}
2024-03-19T03:08:10.610Z    INFO    IsImageUpdated  {"m.Spec.Com.Image": "milvusdb/milvus:v2.3.11", "deployStatus.Image": "milvusdb/milvus:v2.3.11", "name": "querynode", "component": "datanode"}
2024-03-19T03:08:10.610Z    INFO    IsImageUpdated  {"deployStatus.GetState()": "Complete", "name": "querynode", "component": "vs-milvus.milvus:19530", "component": "datanode"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "datanode"}
2024-03-19T03:08:10.610Z    INFO    updatePodTemplate   {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "datanode"}
2024-03-19T03:08:10.611Z    INFO    controller.milvus   nothing diff context    {"name": "vs-milvus-datanode", "namespace": "milvus"}

2024-03-19T03:08:10.610Z    INFO    controller.milvus   Reconcile Component {"name": "vs", "component": "indexnode", "mode": "cluster"}
2024-03-19T03:08:10.610Z    INFO    controller.milvus   Start Update Deployment {"name": "vs-milvus-indexnode"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "indexnode"}
2024-03-19T03:08:10.610Z    INFO    GetDependencies {"depGraph": [], "component": {"Name":"indexnode","FieldName":"IndexNode","DefaultPort":21121}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep count": 0, "component": {"Name":"indexnode","FieldName":"IndexNode","DefaultPort":21121}}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "indexnode"}
2024-03-19T03:08:10.611Z    INFO    updatePodTemplate   {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "indexnode"}
2024-03-19T03:08:10.611Z    INFO    controller.milvus   nothing diff context    {"name": "vs-milvus-indexnode", "namespace": "milvus"}

2024-03-19T03:08:10.610Z    INFO    controller.milvus   Reconcile Component {"name": "vs", "component": "querynode", "mode": "cluster"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"isCreating": false, "updater.GetMilvus().IsRollingUpdateEnabled()": true, "updater.GetMilvus().Spec.Com.ImageUpdateMode": "rollingUpgrade", "name": "querynode"}
2024-03-19T03:08:10.610Z    INFO    GetDependencies {"depGraph": [{"Name":"mixcoord","FieldName":"MixCoord","DefaultPort":-1}], "component": {"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep count": 1, "component": {"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}}
2024-03-19T03:08:10.610Z    INFO    RollingUpdateImageDependencyReady   {"dep": "mixcoord:MixCoord", "component": {"Name":"querynode","FieldName":"QueryNode","DefaultPort":21123}}
2024-03-19T03:08:10.610Z    INFO    IsImageUpdated  {"ComponentsDeployStatus": 6, "name": "mixcoord", "component": "querynode"}
2024-03-19T03:08:10.610Z    INFO    IsImageUpdated  {"m.Spec.Com.Image": "milvusdb/milvus:v2.3.11", "deployStatus.Image": "milvusdb/milvus:v2.3.11", "name": "mixcoord", "component": "querynode"}
2024-03-19T03:08:10.610Z    INFO    IsImageUpdated  {"deployStatus.GetState()": "Complete", "name": "mixcoord", "component": "vs-milvus.milvus:19530", "component": "querynode"}
2024-03-19T03:08:10.610Z    INFO    updateMilvusContainer   {"container.Image": "milvusdb/milvus:v2.3.11", "mergedComSpec.Image": "milvusdb/milvus:v2.3.11", "name": "querynode"}
2024-03-19T03:08:10.610Z    INFO    updatePodTemplate   {"IsEqual template": true, "image": "milvusdb/milvus:v2.3.11", "name": "querynode"}

2024-03-19T03:08:10.611Z    INFO    controller.helm getting history for release vs-kafka
2024-03-19T03:08:10.612Z    INFO    controller.helm getting history for release vs-etcd
2024-03-19T03:08:10.612Z    INFO    controller.milvus   Update Deployment   {
"name": "vs-milvus-proxy", 
"namespace": "milvus", 
"diff": "--- From\n+++ To
@@ -433,9 +433,9 @@
       Name: (string) (len=6) \"config\",
       -      Image: (string) (len=31) \"milvusdb/milvus-operator:v0.9.5\",
       -      Command: ([]string) (len=3) {
       -       (string) (len=3) \"/cp\",
       -       (string) (len=14) \"/run.sh,/merge\",
       -       (string) (len=40) \"/milvus/tools/run.sh,/milvus/tools/merge\"
       -      },\n-      Args: ([]string) <nil>,
       +      Image: (string) (len=107) \"u-docker.pkg.dev/mp-artifact-registry-aa49/dev/milvus-operator:2024-03-18_14-55_87fcc43\",
       +      Command: ([]string) (len=1) {
       +       (string) (len=7) \"/bin/sh\"
       +      },
       +      Args: ([]string) (len=1) {
       +       (string) (len=8) \"/init.sh\"
       +      },
              WorkingDir: (string) \"\",
              @@ -448,3 +448,3 @@
                     },
                     -      VolumeMounts: ([]v1.VolumeMount) (len=1) {
                     +      VolumeMounts: ([]v1.VolumeMount) (len=2) {
                             (v1.VolumeMount) {
                             @@ -454,2 +454,10 @@
                                      SubPath: (string) \"\",
                                      +        MountPropagation: (*v1.MountPropagationMode)(<nil>),
                                      +        SubPathExpr: (string) \"\"
                                      +       },
                                      +       (v1.VolumeMount) {
                                      +        Name: (string) (len=13) \"milvus-config\",
                                      +        ReadOnly: (bool) true,
                                      +        MountPath: (string) (len=25) \"/milvus/configs/user.yaml\",
                                      +        SubPath: (string) (len=9) \"user.yaml\",
                                               MountPropagation: (*v1.MountPropagationMode)(<nil>),
                                               @@ -475,3 +483,3 @@
                                                      Name: (string) (len=5) \"proxy\",
                                                      -      Image: (string) (len=23) \"milvusdb/milvus:v2.3.10\",
                                                      +      Image: (string) (len=23) \"milvusdb/milvus:v2.3.11\",
                                                             Command: ([]string) <nil>,
                                                             "}
haorenfsa commented 6 months ago

I mean the log of milvus pods, not milvus-operator's. Operator works just as expected. It's likely a bug of milvus. So we need its log to dig in.

ian-lss commented 6 months ago

milvus-log.zip

Hi, @haorenfsa @weiliu1031 To obtain accurate logs for the specific range, I tried upgrading another cluster (to 2.3.12). I've attached the logs of the old version pods of each component during the rolling upgrade. Is this what you wanted? Since we configured two replicas for each component, there are two pods per component.

Additionally, I have two more questions. First, there was no problem with the upgrade from 2.3.10 to 2.3.11 (except for the mixcoord hangup), but after the upgrade to 2.3.12,

[2024/03/21 06:04:25.302 +00:00] [WARN] [broker/datacoord.go:102] ["failed to update channel checkpoint"] [channelNames="[ai-rootcoord-dml_3_447872829707677642v1,ai-rootcoord-dml_1_447872829696987074v0]"] [channelCheckpointTimes="[2024/03/21 06:04:22.246 +00:002024/03/21 06:03:48.845 +00:00]"] [error="channelCP is nil, vChannel="]
[2024/03/21 06:04:25.302 +00:00] [WARN] [datanode/channel_checkpoint_updater.go:129] ["update channel checkpoint failed"] [error="channelCP is nil, vChannel="]

this error occurred. Initially, I thought it was a problem specific to the situation of a particular cluster, but seeing it occur again after this upgrade makes me wonder if it's a problem with 2.3.12. It seems related to a Milvus issue, similar to the ticket at https://github.com/milvus-io/milvus/issues/30004 The log shows it as a warning, but is it not a critical issue for the service?

Another thing, as you can see from the attached querynode log, there's a significant difference in terminating times. One pod terminates quickly without any issues, but the other takes about 30 minutes to terminate. Why is that? Considering the physical memory usage is the same for both pods, it doesn't seem like a balancing issue, does it?

weiliu1031 commented 6 months ago

milvus-log.zip

Hi, @haorenfsa @weiliu1031 To obtain accurate logs for the specific range, I tried upgrading another cluster (to 2.3.12). I've attached the logs of the old version pods of each component during the rolling upgrade. Is this what you wanted? Since we configured two replicas for each component, there are two pods per component.

Additionally, I have two more questions. First, there was no problem with the upgrade from 2.3.10 to 2.3.11 (except for the mixcoord hangup), but after the upgrade to 2.3.12,

[2024/03/21 06:04:25.302 +00:00] [WARN] [broker/datacoord.go:102] ["failed to update channel checkpoint"] [channelNames="[ai-rootcoord-dml_3_447872829707677642v1,ai-rootcoord-dml_1_447872829696987074v0]"] [channelCheckpointTimes="[2024/03/21 06:04:22.246 +00:002024/03/21 06:03:48.845 +00:00]"] [error="channelCP is nil, vChannel="]
[2024/03/21 06:04:25.302 +00:00] [WARN] [datanode/channel_checkpoint_updater.go:129] ["update channel checkpoint failed"] [error="channelCP is nil, vChannel="]

this error occurred. Initially, I thought it was a problem specific to the situation of a particular cluster, but seeing it occur again after this upgrade makes me wonder if it's a problem with 2.3.12. It seems related to a Milvus issue, similar to the ticket at https://github.com/milvus-io/milvus/issues/30004 The log shows it as a warning, but is it not a critical issue for the service?

Another thing, as you can see from the attached querynode log, there's a significant difference in terminating times. One pod terminates quickly without any issues, but the other takes about 30 minutes to terminate. Why is that? Considering the physical memory usage is the same for both pods, it doesn't seem like a balancing issue, does it?

something wrong with the log above, i saw the mixcoord-1.txt and mixcoord-2.txt, it shows that mixcoord-1 and mixcoord-2 are all stucked at STANDBY state. which means the active mixcoord isn't the mixcoord-1 or mixcoord-2, or i got a duplicate logs of standby mixcoord.

if something stuck at rolling upgrade progress, which means the old active coord/node doesn't exist as expected. and i need logs about the old active coord/node, then i'fill figure out what happens here

weiliu1031 commented 6 months ago

milvus-log.zip

Hi, @haorenfsa @weiliu1031 To obtain accurate logs for the specific range, I tried upgrading another cluster (to 2.3.12). I've attached the logs of the old version pods of each component during the rolling upgrade. Is this what you wanted? Since we configured two replicas for each component, there are two pods per component.

Additionally, I have two more questions. First, there was no problem with the upgrade from 2.3.10 to 2.3.11 (except for the mixcoord hangup), but after the upgrade to 2.3.12,

[2024/03/21 06:04:25.302 +00:00] [WARN] [broker/datacoord.go:102] ["failed to update channel checkpoint"] [channelNames="[ai-rootcoord-dml_3_447872829707677642v1,ai-rootcoord-dml_1_447872829696987074v0]"] [channelCheckpointTimes="[2024/03/21 06:04:22.246 +00:002024/03/21 06:03:48.845 +00:00]"] [error="channelCP is nil, vChannel="]
[2024/03/21 06:04:25.302 +00:00] [WARN] [datanode/channel_checkpoint_updater.go:129] ["update channel checkpoint failed"] [error="channelCP is nil, vChannel="]

this error occurred. Initially, I thought it was a problem specific to the situation of a particular cluster, but seeing it occur again after this upgrade makes me wonder if it's a problem with 2.3.12. It seems related to a Milvus issue, similar to the ticket at https://github.com/milvus-io/milvus/issues/30004 The log shows it as a warning, but is it not a critical issue for the service?

Another thing, as you can see from the attached querynode log, there's a significant difference in terminating times. One pod terminates quickly without any issues, but the other takes about 30 minutes to terminate. Why is that? Considering the physical memory usage is the same for both pods, it doesn't seem like a balancing issue, does it?

above the two problems above:

  1. warn log in datanode is a known issue, will have a related pr,but there will be no negative impact on the cluster
  2. about the querynode stuck, also need the active mixcoord's log to figure out what happens
ian-lss commented 6 months ago

@weiliu1031

The mixcoord-x.log file I provided is definitely the old active coord/node. That's what we're wondering about as well. According to the log, it only executes ["Get signal to exit"] [signal=terminated] at [roles/roles.go:285] and there are no further actions. The question is why it only executes the signal and doesn't proceed any further.

The speculation is that the node appears in terminating state because it has received an exit signal.

ian-lss commented 6 months ago

@weiliu1031

Does "additional active mixcoord log" for the querynode issue refer to the new active mixcoord log?

weiliu1031 commented 6 months ago

@weiliu1031

The mixcoord-x.log file I provided is definitely the old active coord/node. That's what we're wondering about as well. According to the log, it only executes ["Get signal to exit"] [signal=terminated] at [roles/roles.go:285] and there are no further actions. The question is why it only executes the signal and doesn't proceed any further.

The speculation is that the node appears in terminating state because it has received an exit signal.

both mixcoord-1 and mixcoord-2 are in Standby status, but querynode are executing some tasks, suck as loadSegment, which means there eixst a Active mixcoord beside the StandBy mixcoord-1 and StandBy mixcoord-1

ian-lss commented 6 months ago

@weiliu1031

We are operating with only 2 replicas, so there is no additional mixcoord pod.

I'm attaching additional captures of the pod screen from ArgoCD and the pod list from kubectl. screenshot 2024-03-25 16 36 37

ubectl get pod -n ai-milvus
NAME                                     READY   STATUS        RESTARTS        AGE
ai-etcd-0                                1/1     Running       0               33d
ai-etcd-1                                1/1     Running       0               33d
ai-etcd-2                                1/1     Running       0               33d
ai-kafka-0                               1/1     Running       2 (33d ago)     33d
ai-kafka-1                               1/1     Running       2 (33d ago)     33d
ai-kafka-2                               1/1     Running       2 (33d ago)     33d
ai-kafka-zookeeper-0                     1/1     Running       0               33d
ai-kafka-zookeeper-1                     1/1     Running       0               33d
ai-milvus-attu-8f49bb85f-c57qr           1/1     Running       3 (4h58m ago)   18d
ai-milvus-datanode-748759898c-b7gh8      1/1     Running       0               4d4h
ai-milvus-datanode-748759898c-ckzh9      1/1     Running       0               4d4h
ai-milvus-indexnode-856b87865-ldh4r      1/1     Running       0               4d4h
ai-milvus-indexnode-856b87865-zgqdg      1/1     Running       0               4d4h
ai-milvus-mixcoord-5b7b68865-547sk       1/1     Terminating   0               5d22h
ai-milvus-mixcoord-5b7b68865-lgcs8       1/1     Terminating   0               5d22h
ai-milvus-mixcoord-6c7b99564d-p59ml      1/1     Running       0               4d4h
ai-milvus-mixcoord-6c7b99564d-wbwfv      1/1     Running       0               4d4h
ai-milvus-proxy-746f9fcd96-5w5k2         1/1     Running       0               4d4h
ai-milvus-proxy-746f9fcd96-wl7j7         1/1     Running       0               4d4h
ai-milvus-querynode-0-699d99445b-4cghf   1/1     Running       0               4d4h
ai-milvus-querynode-0-699d99445b-j9fvn   1/1     Running       0               4d4h

As you can see, the old mixcoord pod is still in terminating state, and only 2 replicas of the new mixcoord pod exist. Considering there are no issues with the service, I thought that one of the 2 new mixcoord pod replicas must be active or there might be a problem with the log recording.

However, the logs from the 2 new mixcoord pod replicas were not different from those of mixcoord-1 and mixcoord-2 when there was no traffic. I am thinking this way because even the new mixcoord pod that appears to be active is often recording standby logs.

[2024/03/25 07:10:30.988 +00:00] [INFO] [observers/target_observer.go:473] ["observer trigger update current target"] [collectionID=447872829707677642]
2024-03-25 16:10:30.988     
[2024/03/25 07:10:30.988 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=7353d61229ee9c0e5f3312d27227a34b] [collectionID=447872829696987074]
2024-03-25 16:10:30.987     
[2024/03/25 07:10:30.987 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=8b345101e7fc8216db8865e2ef183014] [collectionID=447872829707677642]
2024-03-25 16:10:30.986     
[2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=bf42ba3792f32f5e478abdc3a2acc1ea] [collectionID=447872829696987074]
2024-03-25 16:10:30.986     
[2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=c6087e83b2af9794df7ade62f8283a09] [collectionID=447872829707677642]
2024-03-25 16:10:30.985     
[2024/03/25 07:10:30.985 +00:00] [INFO] [observers/target_observer.go:450] ["Update readable segment version"] [collectionID=447872829707677642] [channelName=ai-rootcoord-dml_2_447872829707677642v0] [nodeID=72] [oldVersion=1711350610988876132] [newVersion=1711350620988345392]
2024-03-25 16:10:30.628     
[2024/03/25 07:10:30.628 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: querycoord is in STANDBY ..."]
2024-03-25 16:10:30.208     
[2024/03/25 07:10:30.208 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: datacoord is in STANDBY ..."]
2024-03-25 16:10:29.875     
[2024/03/25 07:10:29.875 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: rootcoord is in STANDBY ..."]
2024-03-25 16:10:25.265     
[2024/03/25 07:10:25.265 +00:00] [DEBUG] [config/refresher.go:70] ["etcd refreshConfigurations"] [prefix=ai/config] [endpoints="[ai-etcd.ai-milvus:2379]"]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:799] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"] [channel=ai-rootcoord-dml_1_447872829696987074v0] ["# of unflushed segments"=0] ["# of flushed segments"=5] ["# of dropped segments"=0] ["# of indexed segments"=0]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:299] ["channel seek position set from channel checkpoint meta"] [channel=ai-rootcoord-dml_1_447872829696987074v0] [posTs=448620282719502338] [posTime=2024/03/25 07:09:25.794 +00:00]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:116] [GetQueryVChanPositions] [collectionID=447872829696987074] [channel=ai-rootcoord-dml_1_447872829696987074v0] [numOfSegments=5] ["indexed segment"=5]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:784] ["get recovery info request received"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"]

Even now, both pods in another cluster are recording standby logs, but there is absolutely no impact on the service. (The reason for no terminating pods in the capture is because I performed a force delete.)

screenshot 2024-03-25 16 59 05 ai-milvus-mixcoord-76599769-1.txt ai-milvus-mixcoord-76599769-2.txt

I have been questioning why both mixcoord pod logs output standby for some time.

weiliu1031 commented 6 months ago

@weiliu1031

We are operating with only 2 replicas, so there is no additional mixcoord pod.

I'm attaching additional captures of the pod screen from ArgoCD and the pod list from kubectl. screenshot 2024-03-25 16 36 37

ubectl get pod -n ai-milvus
NAME                                     READY   STATUS        RESTARTS        AGE
ai-etcd-0                                1/1     Running       0               33d
ai-etcd-1                                1/1     Running       0               33d
ai-etcd-2                                1/1     Running       0               33d
ai-kafka-0                               1/1     Running       2 (33d ago)     33d
ai-kafka-1                               1/1     Running       2 (33d ago)     33d
ai-kafka-2                               1/1     Running       2 (33d ago)     33d
ai-kafka-zookeeper-0                     1/1     Running       0               33d
ai-kafka-zookeeper-1                     1/1     Running       0               33d
ai-milvus-attu-8f49bb85f-c57qr           1/1     Running       3 (4h58m ago)   18d
ai-milvus-datanode-748759898c-b7gh8      1/1     Running       0               4d4h
ai-milvus-datanode-748759898c-ckzh9      1/1     Running       0               4d4h
ai-milvus-indexnode-856b87865-ldh4r      1/1     Running       0               4d4h
ai-milvus-indexnode-856b87865-zgqdg      1/1     Running       0               4d4h
ai-milvus-mixcoord-5b7b68865-547sk       1/1     Terminating   0               5d22h
ai-milvus-mixcoord-5b7b68865-lgcs8       1/1     Terminating   0               5d22h
ai-milvus-mixcoord-6c7b99564d-p59ml      1/1     Running       0               4d4h
ai-milvus-mixcoord-6c7b99564d-wbwfv      1/1     Running       0               4d4h
ai-milvus-proxy-746f9fcd96-5w5k2         1/1     Running       0               4d4h
ai-milvus-proxy-746f9fcd96-wl7j7         1/1     Running       0               4d4h
ai-milvus-querynode-0-699d99445b-4cghf   1/1     Running       0               4d4h
ai-milvus-querynode-0-699d99445b-j9fvn   1/1     Running       0               4d4h

As you can see, the old mixcoord pod is still in terminating state, and only 2 replicas of the new mixcoord pod exist. Considering there are no issues with the service, I thought that one of the 2 new mixcoord pod replicas must be active or there might be a problem with the log recording.

However, the logs from the 2 new mixcoord pod replicas were not different from those of mixcoord-1 and mixcoord-2 when there was no traffic. I am thinking this way because even the new mixcoord pod that appears to be active is often recording standby logs.

[2024/03/25 07:10:30.988 +00:00] [INFO] [observers/target_observer.go:473] ["observer trigger update current target"] [collectionID=447872829707677642]
2024-03-25 16:10:30.988     
[2024/03/25 07:10:30.988 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=7353d61229ee9c0e5f3312d27227a34b] [collectionID=447872829696987074]
2024-03-25 16:10:30.987     
[2024/03/25 07:10:30.987 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=8b345101e7fc8216db8865e2ef183014] [collectionID=447872829707677642]
2024-03-25 16:10:30.986     
[2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=bf42ba3792f32f5e478abdc3a2acc1ea] [collectionID=447872829696987074]
2024-03-25 16:10:30.986     
[2024/03/25 07:10:30.986 +00:00] [INFO] [datacoord/index_service.go:821] ["List index success"] [traceID=c6087e83b2af9794df7ade62f8283a09] [collectionID=447872829707677642]
2024-03-25 16:10:30.985     
[2024/03/25 07:10:30.985 +00:00] [INFO] [observers/target_observer.go:450] ["Update readable segment version"] [collectionID=447872829707677642] [channelName=ai-rootcoord-dml_2_447872829707677642v0] [nodeID=72] [oldVersion=1711350610988876132] [newVersion=1711350620988345392]
2024-03-25 16:10:30.628     
[2024/03/25 07:10:30.628 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: querycoord is in STANDBY ..."]
2024-03-25 16:10:30.208     
[2024/03/25 07:10:30.208 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: datacoord is in STANDBY ..."]
2024-03-25 16:10:29.875     
[2024/03/25 07:10:29.875 +00:00] [DEBUG] [sessionutil/session_util.go:1059] ["serverName: rootcoord is in STANDBY ..."]
2024-03-25 16:10:25.265     
[2024/03/25 07:10:25.265 +00:00] [DEBUG] [config/refresher.go:70] ["etcd refreshConfigurations"] [prefix=ai/config] [endpoints="[ai-etcd.ai-milvus:2379]"]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:799] ["datacoord append channelInfo in GetRecoveryInfo"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"] [channel=ai-rootcoord-dml_1_447872829696987074v0] ["# of unflushed segments"=0] ["# of flushed segments"=5] ["# of dropped segments"=0] ["# of indexed segments"=0]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:299] ["channel seek position set from channel checkpoint meta"] [channel=ai-rootcoord-dml_1_447872829696987074v0] [posTs=448620282719502338] [posTime=2024/03/25 07:09:25.794 +00:00]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/handler.go:116] [GetQueryVChanPositions] [collectionID=447872829696987074] [channel=ai-rootcoord-dml_1_447872829696987074v0] [numOfSegments=5] ["indexed segment"=5]
2024-03-25 16:10:20.989     
[2024/03/25 07:10:20.989 +00:00] [INFO] [datacoord/services.go:784] ["get recovery info request received"] [traceID=20df5b89f35d7502f6946ead08031b94] [collectionID=447872829696987074] [partitionIDs="[]"]

Even now, both pods in another cluster are recording standby logs, but there is absolutely no impact on the service. (The reason for no terminating pods in the capture is because I performed a force delete.)

screenshot 2024-03-25 16 59 05 ai-milvus-mixcoord-76599769-1.txt ai-milvus-mixcoord-76599769-2.txt

I have been questioning why both mixcoord pod logs output standby for some time.

please give the logs of the above pods ai-milvus-mixcoord-5b7b68865-547sk and ai-milvus-mixcoord-5b7b68865-lgcs8, which should hold the active session lock

ian-lss commented 6 months ago

@weiliu1031

The logs for that pod are mixcoord-1 and mixcoord-2.

weiliu1031 commented 6 months ago

@weiliu1031

The logs for that pod are mixcoord-1 and mixcoord-2.

please upload 4 mixcoord's logs

ian-lss commented 6 months ago

@weiliu1031

This is the log of the currently running mixcoord node at the time of the upgrade.

mixcoord-6c7b99564d-1.txt mixcoord-6c7b99564d-2.txt

weiliu1031 commented 6 months ago

@weiliu1031

This is the log of the currently running mixcoord node at the time of the upgrade.

mixcoord-6c7b99564d-1.txt mixcoord-6c7b99564d-2.txt

image

for rootcoord, there has 12 alive sessions, but the active one is still the oldest pod with serverId=10 and address=10.4.31.99:53100, all the rootcoord behind it stucks at STANDBY mode, cause the oldest one still hold the active session, find the oldest mixcoord by address and upload the log.

whatever you have done, the oldest mixcoord doesn't exit, the new start mixcoord will keep STANDBY forever

haorenfsa commented 6 months ago

@weiliu1031

cause the oldest one still hold the active session

Why is it? The oldest mixcoord no longer exist, why does it still hold the active session? is it a bug?

weiliu1031 commented 6 months ago

@weiliu1031

cause the oldest one still hold the active session

Why is it? The oldest mixcoord no longer exist, why does it still hold the active session? is it a bug?

if the oldest mixcoord doesn't exist, it should be a bug. but the keep alive intervel for session's lease is about 30s, which means the oldest mixcoord's session should be auto released after 30s, but after a few days it still exists, so i guess the oldest mixcoord doesn't exit

ian-lss commented 6 months ago

@weiliu1031

It's true that there was a pod with that IP. However, as shown in the attached image, it's a pod that was terminated on March 19th. This means that the pod didn't exist at the time of the upgrade. How did the rootcoord know to connect, and how could it possibly leave a log indicating that the connection was successful?

screenshot 2024-03-26 오후 3 20 45

weiliu1031 commented 6 months ago

@weiliu1031

It's true that there was a pod with that IP. However, as shown in the attached image, it's a pod that was terminated on March 19th. This means that the pod didn't exist at the time of the upgrade. How did the rootcoord know to connect, and how could it possibly leave a log indicating that the connection was successful?

screenshot 2024-03-26 오후 3 20 45

milvus use etcd lease impl session, and it have a keepalive policy for every 30s. for the latest rootcoord,it get all sessions from etcd, and it found the oldest's coord still hold the session, which means the oldest mixcoord doesn't exit

haorenfsa commented 6 months ago

@ian-lss did you use kubectl delete --force to delete that pod? in this case it's possible that the container is still running on the worker node holding the active session, without being detected.

haorenfsa commented 6 months ago

Could you login to that node and check if the mixcoord container is still running?

ian-lss commented 6 months ago

@weiliu1031 @haorenfsa

If my memory serves correctly, this is a pod that was force deleted for the upgrade to version 2.3.11, before the upgrade to 2.3.12. Could this be the cause? If so, does this mean that mixcoord should not force delete terminating pods? The reason for the force delete was because, similarly, the mixcoord pod did not terminate due to a previous upgrade and remained in a terminating state.

As mentioned above, after 3/19, the Pod does not physically exist. The upgrade was on 3/21,

weiliu1031 commented 6 months ago

@weiliu1031 @haorenfsa

If my memory serves correctly, this is a pod that was force deleted for the upgrade to version 2.3.11, before the upgrade to 2.3.12. Could this be the cause? If so, does this mean that mixcoord should not force delete terminating pods? The reason for the force delete was because, similarly, the mixcoord pod did not terminate due to a previous upgrade and remained in a terminating state.

As mentioned above, after 3/19, the Pod does not physically exist. The upgrade was on 3/21,

yes, force delete caused the oldest querycoord exist until now, you may need login to the node to cleanup the unexpected pod. which is actually play as the Active mixcoord now.

ian-lss commented 6 months ago

@weiliu1031

What does 'you may need to log in to the node to clean up the unexpected pod`

mean?

The initial reason we had to force delete was due to a GKE issue that resulted in one of the mixcoords being deleted and recreated (with its IP changed). Since then, terminating pods have started to appear with every upgrade. This situation seems like it could occur frequently during operations. Does this mean that from this point on, there will inevitably be terminating pods with every upgrade, implying that a regular cleanup will always be necessary?

weiliu1031 commented 6 months ago

@weiliu1031

What does 'you may need to log in to the node to clean up the unexpected pod`

mean?

The initial reason we had to force delete was due to a GKE issue that resulted in one of the mixcoords being deleted and recreated (with its IP changed). Since then, terminating pods have started to appear with every upgrade. This situation seems like it could occur frequently during operations. Does this mean that from this point on, there will inevitably be terminating pods with every upgrade, implying that a regular cleanup will always be necessary?

the answer is no, the kubectl delete --force mixcoord operation cause a unexpected mixcoord, which hold the session until now, just remove it. the future rolling upgrade will back to normal.

ian-lss commented 6 months ago

@weiliu1031

I don't understand what you mean by remove the session.'What session should be removed, and how?

weiliu1031 commented 6 months ago

@weiliu1031

I don't understand what you mean by remove the session.'What session should be removed, and how?

the oldest mixcoord with address address=10.4.31.99:53100 is still alive on the node, you need to shutdown it. you have executed kubectl delete --force on this container, but the container is still running on the worker node holding the active session, without being detected.

ian-lss commented 6 months ago

@weiliu1031 I keep informing you that the pod with IP 10.4.31.99 was physically deleted two days before the upgrade, and it no longer exists. That's why I was wondering what needs to be removed.

weiliu1031 commented 6 months ago

@weiliu1031 I keep informing you that the pod with IP 10.4.31.99 was physically deleted two days before the upgrade, and it no longer exists. That's why I was wondering what needs to be removed.

when you execute kubectl delete --force, the pod has been removed from k8s's apiserver, but the container is still alive, and won't be detected by k8s

haorenfsa commented 6 months ago

If my memory serves correctly, this is a pod that was force deleted for the upgrade to version 2.3.11, before the upgrade to 2.3.12. Could this be the cause?

Yes, it's the cause

If so, does this mean that mixcoord should not force delete terminating pods?

Yes, force delete only stops k8s from tracking the pod's lifecycle. It won't actually guarantee the deletion of the pod on machine.

The reason for the force delete was because, similarly, the mixcoord pod did not terminate due to a previous upgrade and remained in a terminating state.

In this case, if you want it to end in a short time you could either: kubectl delete $podName --grace-period=1 or kubectl exec -it $podName -- bash -c 'pid=$(ps -C milvus -o pid=); kill -9 $pid'

haorenfsa commented 6 months ago

I keep informing you that the pod with IP 10.4.31.99 was physically deleted two days before the upgrade, and it no longer exists. That's why I was wondering what needs to be removed.

As my above description, the 10.4.31.99 is only logically disappeared from k8s' controll plane, but it may still be running on the worker node machine.

You can login to the machine, if possible, to verify this theory. If it's a cloud provider's vm or something, try getting it offline or reboot.

ian-lss commented 6 months ago

@haorenfsa @weiliu1031

Thank you. I have completely understood the phenomenon you explained. I also understand why the "Display Error issue close" occurred. And, I remember that we performed a force delete during the upgrade process, but not from the beginning, though I'm not entirely certain of my memory. Therefore, I will test a few pod terminate situations in GKE and inform you.

ian-lss commented 6 months ago

@haorenfsa @weiliu1031

When attempting to delete and recreate one of the mixcoord pods by running the kubectl delete command, a new pod is created, and the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord. In the current situation, since the active mixcoord is in service, we cannot resolve the terminating situation by deleting or restarting the active mixcoord, similar to what we would do with a force delete. How should a general mixcoord pod deletion be handled?

It seems like we began to force delete from this point.

haorenfsa commented 6 months ago

When attempting to delete and recreate one of the mixcoord pods by running the kubectl delete command, a new pod is created, and the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord.

@weiliu1031 This behavior is not as expect, we expect the terminating mixcoord pod to exit by its own.

weiliu1031 commented 6 months ago

@haorenfsa @weiliu1031

When attempting to delete and recreate one of the mixcoord pods by running the kubectl delete command, a new pod is created, and the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord. In the current situation, since the active mixcoord is in service, we cannot resolve the terminating situation by deleting or restarting the active mixcoord, similar to what we would do with a force delete. How should a general mixcoord pod deletion be handled?

It seems like we began to force delete from this point.

what do you mean by the previous pod remains in the Terminating state without terminating because it is connected to the active mixcoord.? why old mixcoord connected to active mixcoord? upload mixcoord logs please

ian-lss commented 6 months ago

@weiliu1031 @haorenfsa

I completely created a new cluster and set up two replicas of mixcoord. Afterwards, I executed a delete command on mixcoord-terminating_old_pod2.txt, and mixcoord-newpod2.txt was newly created. Nevertheless, mixcoord-terminating_old_pod2 remains in a terminating state. Among the contents of mixcoord-terminating_old_pod2.txt, there is ["RootCoordClient GetSessions success"] [address=10.3.22.115:53100]. This IP belongs to mixcoord-pod1. Considering what you said about force deleting before, I thought that session might be the reason why mixcoord-terminating_old_pod2 is still in a terminating state. So, my question was, "How do you delete a pod?"

mixcoord-pod1.txt mixcoord-terminating_old_pod2.txt mixcoord-newpod2.txt

haorenfsa commented 6 months ago

@ian-lss

So, my question was, "How do you delete a pod?"

As I mentioned above:

In this case, if you want it to end in a short time you could either: kubectl delete $podName --grace-period=1 or kubectl exec -it $podName -- bash -c 'pid=$(ps -C milvus -o pid=); kill -9 $pid'

ian-lss commented 6 months ago

@haorenfsa

Right. I haven't tried it yet, but I thought it would work with kubectl delete $podName --grace-period=1 or by killing the process. Therefore, within the operator, I set the TerminationGracePeriodSeconds value to always be 1 month, but I'm wondering why it was set to such a lengthy duration. From the comments alone, the choice of a one-month period doesn't make sense to me.

// oneMonthSeconds we set both podtemplate.spec.terminationGracePeriodSeconds & // deployment.spec.progressDeadlineSeconds to one month, to avoid kill -9 on pod automatically. // so that when pod stuck on rolling, the service will still be available. // We'll have enough time to find the root cause and handle it gracefully.

Personally, even one week seems too long. Can't we make it possible for users to specify this duration in the config?

haorenfsa commented 6 months ago

but I'm wondering why it was set to such a lengthy duration. From the comments alone, the choice of a one-month period doesn't make sense to me.

The mixcoord is exptected to exit by itself within a couple of minutes. So in most cases the terminationGracePeriodSeconds really doesn't matters.

Another case is that it failed to exit by itself, as you encountered. In this case, we're sure something unexpected happened. So there must be human involved to handle it. If we set it too short, the Pod will be killed before one get a chance to react. The service will be severely affected when a pod gets killed.

So we never want the kill process to happen. But the value can only be decreased, not increased, thus we set it to almost infinite to make sure one enough time to react, more degree of freedom.

image

If you find a Pod stucks in terminating, and you're sure that you can accept the downtime to kill it. You can always kill it in at the time you think appropriate.

Personally, even one week seems too long. Can't we make it possible for users to specify this duration in the config?

If you make a patch to add a config field, we will merge it. But I don't think it worth the time. If the mixcoord stucks, it's a bug. We should fix it in Milvus.

ian-lss commented 4 months ago

Hello, @weiliu1031 I’m wondering about the status of this issue. I upgraded from 2.3.12 to 2.4.4 using version 0.9.15, but the same issue is still occurring. There are no issues when applying spec changes or config changes other than the upgrade. cc : @haorenfsa

haorenfsa commented 3 months ago

Hi @ian-lss, as far as I know, it's been completely solved in 2.3.16+ or 2.4.3+. @weiliu1031 any comments?

haorenfsa commented 3 months ago

@ian-lss Since you upgraded from 2.3.12, it's as expected you may still run into this case. It shall be fine afterwards. By the way, it now make sense to me to make terminationGracePeriodSeconds configurable, since I don't have much confidence that this issue will 100% not occur again in Milvus. It will be available in next version.

ian-lss commented 3 months ago

@haorenfsa Thank you for your response. I will proceed with the upgrade in the next version and let you know the results.

ian-lss commented 3 months ago

@haorenfsa I upgraded to version 2.4.5 this time. As you mentioned, the mixcoordi pod has been neatly resolved. Thank you for your response during this time.