zilliztech / milvus-operator

The Kubernetes Operator of Milvus.
https://milvus.io
Apache License 2.0
33 stars 20 forks source link

Support Host Network Mode for Pods? #134

Closed qchenzi closed 4 days ago

qchenzi commented 2 weeks ago

Description

Milvus is often used for real-time data processing and large-scale vector similarity search, which require high throughput and low latency. Supporting host network mode at the Pod level can reduce network latency by eliminating container networking overhead, crucial for performance-sensitive applications.

Proposed Solution

Introduce an option in the Milvus Operator configuration to enable host network mode for specific components, such as the milvus-proxy, allowing users to opt-in based on their performance needs.

Benefits

Reduced Latency: Direct access to the host's network stack can significantly lower network latency. Improved Throughput: Enhanced network performance can increase query handling capacity. Flexibility: Users can optimize deployments based on specific requirements.

Supporting host network mode for specific components would greatly benefit real-time data processing applications in Milvus.

Thank you for considering this request.

haorenfsa commented 2 weeks ago

Looks good to me. @qchenzi Would you prefer to add it by yourself?

qchenzi commented 2 weeks ago

Sure, I can try to add it myself. But I'm not entirely clear on the related logic and where to start. Could you please provide some guidance or suggestions, particularly on which part of the Milvus Operator configuration I should focus on and any specific components or files that are critical for implementing the host network mode? Your help would be greatly appreciated. Thank you!

haorenfsa commented 1 week ago

If I understand the feature correctly. We should add fields in Milvus CR to adjust the spec.hostNetwork & spec.dnsPolicy in Pod Template. The final pod manifest should be look like below:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

First we need add hostNetwork & dnsPolicy fields in ComponentSpec struct located in https://github.com/zilliztech/milvus-operator/blob/592de3d6e6f7b7a43fbd11c5f2e99d211b036767/apis/milvus.io/v1beta1/components_types.go#L33 And use generate-all to update the CRD manifests & the structs' deep copy functions.

By doing these, we actually add configuration fields like the 2 cases below:

# Case 1: configure all components to hostnetwork
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: my-release
  labels:
    app: milvus
spec:
  components:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
# Case 2: configure some of the components to hostnetwork
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: my-release
  labels:
    app: milvus
spec:
  components:
    proxy:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
    mixcoord:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet

Then we shall add render logics to add these fields in deploy.spec.podTemplate in https://github.com/zilliztech/milvus-operator/blob/592de3d6e6f7b7a43fbd11c5f2e99d211b036767/pkg/controllers/deployment_updater.go#L80

Then it's also required to add unittests in https://github.com/zilliztech/milvus-operator/blob/main/pkg/controllers/deployment_updater_test.go to verify the render logics works.

qchenzi commented 1 week ago

Hi @haorenfsa

I have submitted a pull request that supports host network for component. https://github.com/zilliztech/milvus-operator/pull/141

After implementing the changes, can be seen as shown in the attached image:

image

And we can add configuration fields like this:

apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: my-release
  labels:
    app: milvus
spec:
  components:
    image: "milvusdb/milvus:v2.4.4"
    proxy:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet

Please review the pull request at your earliest convenience.

qchenzi commented 1 week ago

Hi @haorenfsa

I fought some new issues after implementing https://github.com/zilliztech/milvus-operator/pull/141, which makes me confused.

Although the replicas for all components are set to 1, I observe that during the initialization phase, the number of pods increases to 2 for each component, like the attached image below:

image

And this eventually resolves itself and the number of pods returns to the expected count of 1. But I have also noticed that there still are two deployments for querynode, like that image below:

image

Could you please provide any insights into why these issues might be occurring and how we can address them.

qchenzi commented 1 week ago

Hi @haorenfsa

I fought some new issues after implementing #141, which makes me confused.

Although the replicas for all components are set to 1, I observe that during the initialization phase, the number of pods increases to 2 for each component, like the attached image below: image

And this eventually resolves itself and the number of pods returns to the expected count of 1. But I have also noticed that there still are two deployments for querynode, like that image below: image

Could you please provide any insights into why these issues might be occurring and how we can address them.

However, after modifying the updateNetworkSettings function as shown below, these issues appear to be resolved:

Updated function:

func updateNetworkSettings(template *corev1.PodTemplateSpec, updater deploymentUpdater) {
    mergedComSpec := updater.GetMergedComponentSpec()
    template.Spec.HostNetwork = mergedComSpec.HostNetwork

    if len(mergedComSpec.DNSPolicy) > 0 {
        logf.Log.Info("update dns policy", "dnsPolicy", mergedComSpec.DNSPolicy, "component", updater.GetComponentName())
        template.Spec.DNSPolicy = mergedComSpec.DNSPolicy
    }
}

Original function:

func updateNetworkSettings(template *corev1.PodTemplateSpec, updater deploymentUpdater) {
    mergedComSpec := updater.GetMergedComponentSpec()
    template.Spec.HostNetwork = mergedComSpec.HostNetwork
    template.Spec.DNSPolicy = mergedComSpec.DNSPolicy
}

It seems that the conditional check for DNSPolicy has addressed the issue, but I'm not entirely sure why this change resolved it.

image
haorenfsa commented 4 days ago

One more thing to notice when enable this feature: Every milvus pod is using the port 9091. So it's advised that you should add anti-affinity for Milvus pods that enabled hostNetwork. Otherwise, you may failed to restart to scale out because the pods maybe scheduled to the same worker node.