[zabbix-community/zabbix] unstable connection to the zabbix-server service when in HA mode

szelga commented 3 months ago

random (i.e., sometimes connection succeeds, sometimes fails) connection timeout to the port 10051 on the zabbix-server service.

Version of Helm and Kubernetes:

kubernetes version: v1.22.8

helm version: using ArgoCD 2.10 for zabbix deployment, so helm is v3. if the exact helm version is relevant for this issue, I will dig deeper.

What happened:

zabbix agents are unable to connect to the server for heartbeat message or active check most of the time:

2024/06/20 13:29:20.039885 [101] active check configuration update from host [zabbix-agent-host] started to fail
2024/06/20 13:30:13.989139 [101] cannot connect to [zabbix.company.name:10051]: dial tcp :0->13.13.13.13:10051: i/o timeout
2024/06/20 13:30:13.989185 [101] sending of heartbeat message for [zabbix-agent-host] started to fail

coincidentally, telnet zabbix-zabbix-server.monitoring 10051 randomly succeeds or fails, so the problem is not in the ingress.

What you expected to happen:

connection should happen every time.

How to reproduce it:

I think, this could reliably be reproduced on any HA installation. below I will post the relevant values.yaml entries (because it might as well be misconfiguration on my part), feel free to request any additional details.

Anything else we need to know:

if I try to connect to individual zabbix-server pods via port 10051, only 1 (of 3) could be connected to (and it connects reliably). probably, the service is unaware, which pod is active at the moment and tries to connect randomly or something like that.

relevant values:

zabbixServer:
  extraEnv:
    - name: ZBX_AUTOHANODENAME
      value: "fqdn"
  haNodesAutoClean:
    image:
      repository: timescale/timescaledb
      tag: latest-pg16
  replicaCount: 3
  service:
    type: ClusterIP

the other (relevant) values are unchanged.

when in not-HA mode, it works 100% of the time.

fibbs commented 3 months ago

I confirm this is a current problem which I have experienced as well and I am figuring out a solution for.

The problem is that Zabbix' HA mode works in a way that only one Zabbix Server instance is "active". The active instance starts up all internal processes and writes to a database table in short frequency, that it is still there and active. The other server instances only connect to the database, see that another one is "master" and therefore do NOT start any of the internal processes, such as pollers etc. One of these processes is the "trapper" which opens and manages port 10051 which is needed for active checks, active proxies etc.

Long story short: the currently employed solution is not good, as a Kubernetes Service can only round-robin amongst the backends it has.

Possible approaches that I have in mind:

employ a loadbalancer (haproxy) with backend checking This way, we could have ONE little Loadbalancer (even two instances of it, with a Kubernetes service in front) which have all backends configured with checking whether the ports are actually responding or not. This would work, but adds quite some complexity as HAProxy would have to be dynamically configured in case new pods are spawned, Zabbix server is scaled up or down, etc.
build a mini-controller checking the role of the Zabbix-Server pods A little controller, a pod being able to "check" Zabbix Server Pods for being either Master or Slave and setting appropriate labels to the pods. This way, the Zabbix Server Kubernetes Service could have a modified Label selector only pointing to the Master Pod and it would work appropriately. Challenge is to find out the Zabbix Server's Pod role "from the outside" and make sure the controller pod itself is reliable and highly available, as the Zabbix HA setup would depend on the correct label setting quite a lot.
Multiple services / ingresses (one per HA-node) Zabbix' concept foresees that every Active Agent and Active Proxy connecting to a Zabbix server in HA mode has all the Server endpoints configured (with ":" instead of "," to signal it's an HA setup). This way, the agents will figure out which one is being up and active by themselves. If we would modify the Helm Chart to expose one service per Server instance, and one Host Name / LoadBalancer IP address per Zabbix Server instance, we could just not care about which pod of Zabbix server is master and which not. This would be the most difficult one to implement in the Helm chart and in Kubernetes, as the communication protocol is not HTTPS, thus no SNI and no Ingress Controller can be used. I would really not like to depend on being able to expose several "real" IP addresses to the outside just for Zabbix Server HA, and also the management of the mapping between Pod and external IP address (type: LoadBalancer) will be challenging to implement in a generic way.

As of now, I am preferring the "controller" way of implementing this and I have started investigating this further. If any better ideas and inputs, please let me know.

aeciopires commented 3 months ago

Hello guys!

I understood the root cause, but I think this is not problem of helm chart. It's a problem of Zabbix Server application because the HA implementation is not appropriate for Kubernetes.

@fibbs I believe in your suggestions and I think a operator pattern solution (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) is very similar with suggestion 2.

But I think that not worth the effort to fix a application problem with a trick in helm chart.

Looking the Zabbix roadmap, this problem can be solved in 7.2 version planned to Q4/2024 (https://www.zabbix.com/roadmap).

What do you think?

2024-06-21_05-18

szelga commented 3 months ago

I think this is not problem of helm chart.

makes sense.

which solution would be the easiest one to slap on top of the existing chart short-term, until the upstream implements this properly?

aeciopires commented 3 months ago

Hello @szelga!

Unfortunately, I don't have the knowledge to implement any of the suggestions, but let's wait if @fibbs has time available. Also feel free to open a PR if you know how to implement it.

leighwgordon commented 1 month ago

Just chiming in here, I am still looking at this exact scenario. I'm not sure whether it's something in the scope of the Helm chart, but seems like a common complaint, and the issue will be experienced by anyone setting the replicas to more than 1 on the server deployment, so I suspect will continue to be a common complaint!

I recently found this which I will experiment with a variation of it at some point:

https://faun.pub/active-passive-load-balancing-with-kubernetes-services-742cae1938af https://github.com/psdally/k8s-active-passive/blob/main/k8s/base/loadbalancer-deployment.yaml

It's essentially option 2 from @fibbs comment, implemented with a script.

The mechanism I will try first to determine which pod to label as active, will be to simply test the TCP/10051 port, as this is only open on the active server instance.

zabbix-community / helm-zabbix

[zabbix-community/zabbix] unstable connection to the zabbix-server service when in HA mode #98