pirateunclejack commented 5 months ago

Describe the bug As described here : https://www.zabbix.com/forum/zabbix-help/444411-advice-about-running-zabbix-server-inside-a-kubernetes-cluster

When Zabbix server have multiple pods and Zabbix agent working on active mode, zabbix agent cannot communicate with zabbix server normally.

Version of Helm and Kubernetes: Helm: Version: 4.0.2 Kubernetes: v1.22.0 Zabbix server Version : 6.4.10 Zabbix agent2 version: 6.4.9-1+ubuntu20.04

What happened: zabbix-agent2 log :

2024/01/31 17:12:38.126297 [101] cannot connect to [SERVERIP:PORT]: dial tcp :0->SERVERIP:PORT: connect: connection refused
2024/01/31 17:12:38.126408 [101] active check configuration update from host [NODENAME] started to fail
2024/01/31 17:12:44.122352 [101] active check configuration update from [SERVERIP:PORT] is working again

Lost data if zabbix-server working on HA mode(after working normal for about 1 hour, data of all nodes including zabbix server):

What you expected to happen: zabbix-agent2 should communicate with zabbix server properly when zabbix server has multi pods. There should not be connection errors in zabbix-agent2 log. Should not lost monitor data.

How to reproduce it (as minimally and precisely as possible): Setup zabbix server on Kubernetes with 2 pods , add a zabbix-agent2 node working on active mode.

Anything else we need to know: I modified some files in Helm chart,but this should not be the reason of this problem .

aeciopires commented 4 months ago

Hi @pirateunclejack!

Sorry for the delay in my response. When we use Kubernetes, communication involving Zabbix components must not use an IP address, as the pod's IP is ephemeral. Every time a pod is recreated, a different IP is assigned. It is unfeasible to keep changing IPs in the registration of hosts and/or configuration files. The correct way is to use DNS names.

In Kubernetes we have 3 types of service: ClusterIP, NodePort and Loadbalancer. See more details and differences here:

If the target host, which has zabbix-agentd installed, is in the same Kubernetes cluster, you can use the ClusterIP service DNS used by Zabbix Server and configure Zabbix Agentd to send metrics to it. The default ClusterIP name has the following format:

SERVICE_NAME.NAMESPACE.svc.cluster.local:PORT

Example:

zabbix-zabbix-server.monitoring.svc.cluster.local:10051

The name, type and IP of the services can be viewed using the command:

kubectl get service -n NAMESPACE

Example:

kubectl get service -n monitoring

The reverse is also true, that is... if the Zabbix Server expects to talk to the Zabbix Agentd and they are in the same cluster, the DNS name of the Zabbix Agentd ClusterIP service is:

Example:

zabbix-zabbix-agentd.mynamespace-example.svc.cluster.local:10050

Now, if they are on different clusters, networks or hosts, then it will be necessary to configure a Loadbalancer to expose the Zabbix Server, assign a DNS name on the DNS Server pointing to the loadbalancer IP, configure the network route and release permissions on the firewall.

pirateunclejack commented 4 months ago

Hi @aeciopires Thank you for such detailed reply!

In my case Zabbix server deployment is in different network with zabbix-agent . In fact , the SERVERIP in log is the IP of loadbalancer. We do have DNS name on the DNS server pointing to the loadbalancer IP.

I think the problem is : The loadbalancer or Kubernetes CAN NOT identify which zabbix server pod is active to provide service. When request from zabbix client is balnaced to standby zabbix server pod , it will failed.

Do you agree ?

aeciopires commented 4 months ago

Hi @pirateunclejack!

Thanks for adding this detail. I can understand better now. You will probably need to adjust some configuration in your container network interface (CNI) in your Kubernetes cluster to preserve the source IP (in this case the zabbix agent).

Traffic Policies

Your loadbalancer must understands and respects the service’s externalTrafficPolicy option, and implements different announcements modes depending on the policy and announcement protocol you select.

Layer2

When announcing in layer2 mode, one node in your cluster will attract traffic for the service IP. From there, the behavior depends on the selected traffic policy.

“Cluster” traffic policy

With the default Cluster traffic policy, kube-proxy on the node that received the traffic does load balancing, and distributes the traffic to all the pods in your service.

This policy results in uniform traffic distribution across all pods in the service. However, kube-proxy will obscure the source IP address of the connection when it does load balancing, so your pod logs will show that external traffic appears to be coming from the service’s leader node.

“Local” traffic policy

With the Local traffic policy, kube-proxy on the node that received the traffic sends it only to the service’s pod(s) that are on the same node. There is no “horizontal” traffic flow between nodes.

Because kube-proxy doesn’t need to send traffic between cluster nodes, your pods can see the real source IP address of incoming connections.

The downside of this policy is that incoming traffic only goes to some pods in the service. Pods that aren’t on the current leader node receive no traffic, they are just there as replicas in case a failover is needed.

I think this is what you want, right?

In helm-zabbix chart, we have the externalTrafficPolicy option commeted by default. You can uncomment this option and change the value to each component of Zabbix.

References:

pirateunclejack commented 4 months ago

Hi @aeciopires Thank you for such quickly reply.

Unfortuately, switch externalTrafficPolicy from Cluster to Local can not solve problem. This action make zabbix server can not recive monitoring data from zabbix agent, no matter zabbix server is working on standalone mode with only one pod or working on HA mode with two pods.

I am sorry to provided a wrong information. We are not using loadbalancer in Kubernetes cluster. The type of zabbix server service is NodePort . And the SERVERIP in zabbix agent log is the IP of nodes (we have three) which Kuberntes ingress controllers runing on.

Do you think this is the reason ?

aeciopires commented 4 months ago

Hi @pirateunclejack!

Hum... this new information changes the context.

I think this is the problem. I found this post explaining exactly your context:

https://stackoverflow.com/questions/60067188/how-do-kubernetes-nodeport-services-with-service-spec-externaltrafficpolicy-loca

I think you need to use a Loadbalancer type service to expose Zabbix Server.

If you use Kubernetes in an on-premises environment, you can install MetalLB as a Loadbalancer manager.

If you use Kubernetes on the cloud provider, you must use the cloud controller to create the load balancer.

AWS: ALB (Application Load Balancer):

GCP:

Azure:

https://learn.microsoft.com/en-us/azure/aks/load-balancer-standard

Oracle Cloud:

This way you can change the externalTraffic policy correctly

Either way, using the NodePort service in this scenario is a bad idea. The NodePort service is recommended for testing environments only, as when a node fails, it is necessary to change the IP address on all Zabbix Agents/Proxy to redirect traffic to another node IP. This is not scalable or sustainable in the production environment.

I think the root cause of the problem is more related to the Kubernetes configuration and less related to this helm chart or Zabbix.

pirateunclejack commented 4 months ago

Hi @aeciopires Thanks for your reply.

We are using Kubesphere. Although I don't know how they make it , the NodePort can work on ingress controller nodes even zabbix server pod switches node.

Currrntly , it is not possible to deploy Loadbalancer in our Kubernetes cluster. I will deploy a test Kubernetes environment and test .

Thank you very much.

aeciopires commented 4 months ago

Hi @pirateunclejack!

Thanks for sharing this detail. I thought you were using vanilla Kubernetes. I didn't use https://github.com/kubesphere/kubesphere

Good luck with your new tests.

Can we solve this issue?

If you need help, you can open a discussion here https://github.com/zabbix-community/helm-zabbix/discussions (new Github feature) instead of opening an issue. It's like a community forum.

leighwgordon commented 4 months ago

FYI, I was looking at this same issue a few weeks ago and one possible solution is to configure a readinessProbe which only returns true if the pod is running the active Zabbix server. You could test the port, but I thought it was a stronger guarantee to use an exec probe to go straight to the horses mouth; asking Zabbix itself whether it was active or not.

It's possible to get this information from the runtime control ha_status, which will show all servers in the cluster, and their statuses (active/standby/stopped etc.): https://www.zabbix.com/documentation/current/en/manpages/zabbix_server

I don't have a complete working example, but I was testing with 2 server replicas, with the zabbixServer service type set to LoadBalancer and applying this patch to the zabbix-server Deployment:

zabbixServer:
  service:
    externalTrafficPolicy: Local
    type: LoadBalancer


---
- op: add
  path: /spec/template/spec/containers/0/readinessProbe
  value:
    exec:
      command:
        - test
        - $(zabbix_server -R ha_status | awk /$ZBX_NODEADDRESS:10051/'{print $5}')
        - =
        - active
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6

I shelved it for a while because it doesn't play nicely with Argo CD (it leaves the deployment in a permanently degraded state in Argo CD terms, despite the unready pod being desirable in this case) but it may be worth a try in your scenario. Aside from that, it was promising... with the LoadBalancer only ever forwarding traffic to the active server pod.

This scenario is perhaps a bit too opinionated and specific to be much of a chart issue, except perhaps a generic feature allowing the readinessProbe to be configurable through the chart values/server template to avoid having to add patches (similar to how it is already configurable for the web deployment, except that one is hard-coded to the httpGet probe type).

aeciopires commented 4 months ago

Hi @leighwgordon!

I implemented your suggestion in this PR: https://github.com/zabbix-community/helm-zabbix/pull/69 Do you can test?

zabbix-community / helm-zabbix

[zabbix-community/zabbix] Zabbix agent not work properly in active mode when Zabbix server has multi pods #65

Traffic Policies

Layer2

“Cluster” traffic policy

“Local” traffic policy