Closed pirateunclejack closed 4 months ago
Hi @pirateunclejack!
Sorry for the delay in my response. When we use Kubernetes, communication involving Zabbix components must not use an IP address, as the pod's IP is ephemeral. Every time a pod is recreated, a different IP is assigned. It is unfeasible to keep changing IPs in the registration of hosts and/or configuration files. The correct way is to use DNS names.
In Kubernetes we have 3 types of service: ClusterIP, NodePort and Loadbalancer. See more details and differences here:
If the target host, which has zabbix-agentd installed, is in the same Kubernetes cluster, you can use the ClusterIP service DNS used by Zabbix Server and configure Zabbix Agentd to send metrics to it. The default ClusterIP name has the following format:
SERVICE_NAME.NAMESPACE.svc.cluster.local:PORT
Example:
zabbix-zabbix-server.monitoring.svc.cluster.local:10051
The name, type and IP of the services can be viewed using the command:
kubectl get service -n NAMESPACE
Example:
kubectl get service -n monitoring
The reverse is also true, that is... if the Zabbix Server expects to talk to the Zabbix Agentd and they are in the same cluster, the DNS name of the Zabbix Agentd ClusterIP service is:
Example:
zabbix-zabbix-agentd.mynamespace-example.svc.cluster.local:10050
Now, if they are on different clusters, networks or hosts, then it will be necessary to configure a Loadbalancer to expose the Zabbix Server, assign a DNS name on the DNS Server pointing to the loadbalancer IP, configure the network route and release permissions on the firewall.
Hi @aeciopires Thank you for such detailed reply!
In my case Zabbix server deployment is in different network with zabbix-agent .
In fact , the SERVERIP
in log is the IP of loadbalancer. We do have DNS name on the DNS server pointing to the loadbalancer IP.
I think the problem is : The loadbalancer or Kubernetes CAN NOT identify which zabbix server pod is active to provide service. When request from zabbix client is balnaced to standby zabbix server pod , it will failed.
Do you agree ?
Hi @pirateunclejack!
Thanks for adding this detail. I can understand better now. You will probably need to adjust some configuration in your container network interface (CNI) in your Kubernetes cluster to preserve the source IP (in this case the zabbix agent).
Your loadbalancer must understands and respects the service’s externalTrafficPolicy option, and implements different announcements modes depending on the policy and announcement protocol you select.
When announcing in layer2 mode, one node in your cluster will attract traffic for the service IP. From there, the behavior depends on the selected traffic policy.
With the default Cluster traffic policy, kube-proxy on the node that received the traffic does load balancing, and distributes the traffic to all the pods in your service.
This policy results in uniform traffic distribution across all pods in the service. However, kube-proxy will obscure the source IP address of the connection when it does load balancing, so your pod logs will show that external traffic appears to be coming from the service’s leader node.
With the Local traffic policy, kube-proxy on the node that received the traffic sends it only to the service’s pod(s) that are on the same node. There is no “horizontal” traffic flow between nodes.
Because kube-proxy doesn’t need to send traffic between cluster nodes, your pods can see the real source IP address of incoming connections.
The downside of this policy is that incoming traffic only goes to some pods in the service. Pods that aren’t on the current leader node receive no traffic, they are just there as replicas in case a failover is needed.
I think this is what you want, right?
In helm-zabbix chart, we have the externalTrafficPolicy
option commeted by default. You can uncomment this option and change the value to each component of Zabbix.
References:
Hi @aeciopires Thank you for such quickly reply.
Unfortuately, switch externalTrafficPolicy
from Cluster
to Local
can not solve problem.
This action make zabbix server can not recive monitoring data from zabbix agent, no matter zabbix server is working on standalone mode with only one pod or working on HA mode with two pods.
I am sorry to provided a wrong information. We are not using loadbalancer in Kubernetes cluster.
The type of zabbix server service is NodePort . And the SERVERIP
in zabbix agent log is the IP of nodes (we have three) which Kuberntes ingress controllers runing on.
Do you think this is the reason ?
Hi @pirateunclejack!
Hum... this new information changes the context.
I think this is the problem. I found this post explaining exactly your context:
I think you need to use a Loadbalancer type service to expose Zabbix Server.
If you use Kubernetes in an on-premises environment, you can install MetalLB as a Loadbalancer manager.
If you use Kubernetes on the cloud provider, you must use the cloud controller to create the load balancer.
AWS: ALB (Application Load Balancer):
GCP:
Azure:
Oracle Cloud:
This way you can change the externalTraffic policy correctly
Either way, using the NodePort service in this scenario is a bad idea. The NodePort service is recommended for testing environments only, as when a node fails, it is necessary to change the IP address on all Zabbix Agents/Proxy to redirect traffic to another node IP. This is not scalable or sustainable in the production environment.
I think the root cause of the problem is more related to the Kubernetes configuration and less related to this helm chart or Zabbix.
Hi @aeciopires Thanks for your reply.
We are using Kubesphere. Although I don't know how they make it , the NodePort can work on ingress controller nodes even zabbix server pod switches node.
Currrntly , it is not possible to deploy Loadbalancer in our Kubernetes cluster. I will deploy a test Kubernetes environment and test .
Thank you very much.
Hi @pirateunclejack!
Thanks for sharing this detail. I thought you were using vanilla Kubernetes. I didn't use https://github.com/kubesphere/kubesphere
Good luck with your new tests.
Can we solve this issue?
If you need help, you can open a discussion here https://github.com/zabbix-community/helm-zabbix/discussions (new Github feature) instead of opening an issue. It's like a community forum.
FYI, I was looking at this same issue a few weeks ago and one possible solution is to configure a readinessProbe which only returns true if the pod is running the active Zabbix server. You could test the port, but I thought it was a stronger guarantee to use an exec probe to go straight to the horses mouth; asking Zabbix itself whether it was active or not.
It's possible to get this information from the runtime control ha_status
, which will show all servers in the cluster, and their statuses (active/standby/stopped etc.): https://www.zabbix.com/documentation/current/en/manpages/zabbix_server
I don't have a complete working example, but I was testing with 2 server replicas, with the zabbixServer service type set to LoadBalancer
and applying this patch to the zabbix-server Deployment:
zabbixServer:
service:
externalTrafficPolicy: Local
type: LoadBalancer
---
- op: add
path: /spec/template/spec/containers/0/readinessProbe
value:
exec:
command:
- test
- $(zabbix_server -R ha_status | awk /$ZBX_NODEADDRESS:10051/'{print $5}')
- =
- active
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 6
I shelved it for a while because it doesn't play nicely with Argo CD (it leaves the deployment in a permanently degraded state in Argo CD terms, despite the unready pod being desirable in this case) but it may be worth a try in your scenario. Aside from that, it was promising... with the LoadBalancer only ever forwarding traffic to the active server pod.
This scenario is perhaps a bit too opinionated and specific to be much of a chart issue, except perhaps a generic feature allowing the readinessProbe to be configurable through the chart values/server template to avoid having to add patches (similar to how it is already configurable for the web deployment, except that one is hard-coded to the httpGet
probe type).
Hi @leighwgordon!
I implemented your suggestion in this PR: https://github.com/zabbix-community/helm-zabbix/pull/69 Do you can test?
Describe the bug As described here : https://www.zabbix.com/forum/zabbix-help/444411-advice-about-running-zabbix-server-inside-a-kubernetes-cluster
When Zabbix server have multiple pods and Zabbix agent working on active mode, zabbix agent cannot communicate with zabbix server normally.
Version of Helm and Kubernetes: Helm: Version: 4.0.2 Kubernetes: v1.22.0 Zabbix server Version : 6.4.10 Zabbix agent2 version: 6.4.9-1+ubuntu20.04
What happened: zabbix-agent2 log :
Lost data if zabbix-server working on HA mode(after working normal for about 1 hour, data of all nodes including zabbix server):![image](https://github.com/zabbix-community/helm-zabbix/assets/17961436/5d673c8f-c6a1-4603-8dda-ca4f3f3c101a)
What you expected to happen: zabbix-agent2 should communicate with zabbix server properly when zabbix server has multi pods. There should not be connection errors in zabbix-agent2 log. Should not lost monitor data.
How to reproduce it (as minimally and precisely as possible): Setup zabbix server on Kubernetes with 2 pods , add a zabbix-agent2 node working on active mode.
Anything else we need to know: I modified some files in Helm chart,but this should not be the reason of this problem .