upgrade issue - Githubissues

hooray4me commented 1 year ago

Describe the bug unable to upgrade from 6.0.13 to 6.4. I can't seem to find a way to comment out HANodeName so that the server can start up in standalone mode.

Version of Helm and Kubernetes: 1.27

Any suggestions?

aeciopires commented 1 year ago

Hi @hooray4me!

Sorry by late.

Today I published versions 4.0.0 and 4.0.1 of the chart which contains some important changes. I recommend that you read and test.

The HA mode of the Zabbix Server can be disabled with the following values:

zabbixServer:
  enabled: true
  replicaCount: 1

HA mode only works with two or more Zabbix Server replicas.

IlyaPupkovs commented 11 months ago

Today I tried to upgrade zabbix from 6.0.9 to 6.4.7 and even with

zabbixServer:
  enabled: true
  replicaCount: 1

it still starts in HAmode:

8:20231026:124626.896 current database version (mandatory/optional): 06000000/06000043
8:20231026:124626.896 required mandatory version: 06040000
8:20231026:124626.896 mandatory patches were found
8:20231026:124626.906 cannot perform database upgrade in HA mode: all nodes need to be stopped and Zabbix server started in standalone mode for the time of upgrade.

Zabbix upgrade documentation says: " [...] change its configuration to standalone mode by commenting out HANodeName [parameter]" So I tried to add

    - name: "ZBX_HANODENAME"
      value:

to zabbixServer.extraEnv: but deployment ignores it:

** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherAll13": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherCert": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherCert13": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherPSK": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherPSK13": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSKeyFile": 'privatekey'...added
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSPSKIdentity": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSPSKFile": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "ServiceManagerSyncFrequency": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "HANodeName": 'zabbix-services-zabbix-server-ddff74775-rhl4z'...added
** Updating '/etc/zabbix/zabbix_server.conf' parameter "NodeAddress": '10.66.34.204'...added
** Updating '/etc/zabbix/zabbix_server.conf' parameter "User": 'zabbix'...added

Changing docker images gave no result so I assume helm somehow defines ZBX_HANODENAME=hostname

P.S. removing ZBX_HANODENAME or setting it to null didnt take any effect

IlyaPupkovs commented 11 months ago

So in the end I was able to upgrade from from 6.0.9 to 6.4.8 Used helm chart version 4.0.2, image alpine-6.4-latest

Problem was with undocumented parameter ZBX_AUTOHANODENAME which is hardcoded into chart, is always present on pod and is responsible for starting server in HA mode.

Interestingly enough I could set

- ZBX_AUTOHANODENAME
  value: ""

only without any other parameters in zabbixServer.extraEnv: If any other parameter (in this case ZBX_HANODENAME) was present it resulted in error:

client.go:428: [debug] error updating the resource "zabbix-zabbix-server":
         cannot patch "zabbix-zabbix-server" with kind Deployment: The order in patch list:
[map[name:ZBX_AUTOHANODENAME value:hostname] map[name:ZBX_AUTOHANODENAME value:] map[name:ZBX_HANODENAME value:]]
 doesn't match $setElementOrder list:
[map[name:DB_SERVER_HOST] map[name:DB_SERVER_PORT] map[name:POSTGRES_USER] map[name:POSTGRES_PASSWORD] map[name:POSTGRES_DB] map[name:ZBX_AUTOHANODENAME] map[name:ZBX_HANODENAME] map[name:ZBX_AUTOHANODENAME] map[name:ZBX_NODEADDRESS] map[name:ZBX_WEBSERVICEURL] map[name:ZBX_STARTREPORTWRITERS]]

So I did two deployment cycles - one without any parameters except ZBX_AUTOHANODENAME, and after DB was upgraded second cycle with all usual parameters without ZBX_AUTOHANODENAME.

fibbs commented 11 months ago

It is by design that the Zabbix server does ALWAYS start in HA-Mode even with Replicas set to 1. This is in order to make sure that a scale-up does just work, and has, at least didn't have when I developed that part, no negative effect except being a "HA cluster with just one node". The issue with upgrading major version is not entirely solved yet. The best workaround, if I understand your post correctly would be to scale down to just one replica, then do the upgrade, then scale up again. Or do I get something completely wrong?

fibbs commented 11 months ago

Sorry, I did not read carefully. So, the problem is that apparently recently Zabbix server does not accept to upgrade the database if run in HA mode. This is actually new to me. Let me think about how to solve this in most elegant way... First shot of idea: We have a job that runs in single mode that prepares database before the "real" Zabbix server pods start up, and which is designed to prepare the database structure in case of a fresh installation. I am thinking in a similar solution for upgrading:

in the sidecars of the Zabbix Server pods, which prevent those from starting when no database is there yet, add a check to figure out that a major release upgrade is necessary and prevent Zabbix servers from starting
start a job that does this upgrade, using the Zabbix Server image but starting it one-shot and only to upgrade database
then let the Zabbix Server(s) start

IlyaPupkovs commented 11 months ago

Yep, exactly - Zabbix server does not accept to upgrade the database if run in HA mode. As for solution, sounds great if could be implemented in that way

fibbs commented 10 months ago

I am in incubating phase for finding a solution :)

szelga commented 8 months ago

for me, setting ZBX_AUTOHANODENAME to "" (w/o specifying ZBX_HANODENAME in values.yaml in any way whatsoever) during the upgrade did the trick. the other extra env variables (I use TimescaleDB, so can't do w/o them) I didn't touch.

UPD: and setting replicaCount to 1 during the upgrade, of course.

fibbs commented 3 months ago

An upgrade from 6 to 7 unfortunately fails (well it actually doesn't fail but it doesn't complete entirely) when using TimescaleDB due to the fact that the timescaledb.sql must be executed once again to create the newly needed hypertable:

229:20240619:083705.201 [Z3005] query failed: [0] PGRES_FATAL_ERROR:ERROR:  table "auditlog" is not a hypertable

I am wondering whether the best way to solve this once for all is to create a post-install and post-upgrade hook job that handles all the database schema relevant tasks. Up to now we have one Job, simply being deployed with the Chart and only taking care of initializing the database. The good thing was that it was not needed to create a custom image for that, just a bit of sed-magic. I think this has to be redesigned entirely, also for future use cases, having ONE custom image taking care of:

creating empty database schema if none existing
upgrading database schema in case a major release upgrade happened
initializing / upgrading TiimescaleDB stuff

It should be built as a custom image, or at least using an entrypoint script mounted as a configmap or such, but the image should be based on the Zabbix Server image (needed for the actual Upgrade of DB schema).

From my point of view, this should also fix the above found problem when Zabbix Server is running in HA mode.

Any more comments on this? Will investigate further during the next days.

crowleym commented 3 months ago

@fibbs I was able to upgrade from Zabbix 6.5 to 7 with following steps.

Edit values to scale down Zabbix Server to replicaCount: 0 and deploy using zabbix-community/zabbix
Clone helm chart source and comment out ZBX_AUTOHANODENAME config (name and value) in https://github.com/zabbix-community/helm-zabbix/blob/master/charts/zabbix/templates/deployment-zabbix-server.yaml#L142
Deploy from this local clone with replicaCount: 1
Follow container logs until DB upgrade was complete.
Login and test

Now when scaling the server back to original replicaCount value of 3 I get the following error

Error: UPGRADE FAILED: error validating "": error validating data: ValidationError(Job.spec): unknown field "metadata" in io.k8s.api.batch.v1.JobSpec

Same error occurs when deploying from zabbix-community/zabbix or from local clone.

Looking into it I see an if statement that affects how things are deployed depending on the replicaCount, so will look to understand this more. https://github.com/zabbix-community/helm-zabbix/blob/master/charts/zabbix/templates/job-init-db-schema.yaml#L1

Once found I am guessing a PR with the same if statement could be applied to disable HA automatically for replicaCount: 1 so the ZBX_AUTOHANODENAME is not applied.

crowleym commented 3 months ago

When the server is started in single mode, it automatically upgrades the DB it self, and therefore I am questioning the need for the job at all, if Zabbix has changed it behaviour as mentioned in a comment above.

By adding a false condition to the top of the job template as well as only applying ZBX_AUTOHANODENAME if replicaCount was greater than 1, I was able to use the chart to upgrade from 6.5-7.

I have made a PR #102 in case it helps someone else, but I cannot comment on the validity of removing the Job entirely beyond "it worked for me"

fibbs commented 3 months ago

thanks @crowleym, indeed that's exactly the way I did upgrades, but it is a bit "hacky" and shouldn't be this way, which is why I am working on a good solution. I don't want this Helm Chart to run the Zabbix server in "single mode", even when only having one Replica. We have defined this back then when DB upgrades worked also in HA mode because of wanting to be able to scale up and down at any time.

I have an almost-working solution here in my lab, with one or two challenges to solve. One of them is to start a zabbix_server process to only upgrade the database schema and then stop, which I will try to achieve with a hacky "start process in background and loop reading its STDOUT" kind of construct. The solution will work as follows, briefly:

zabbix server runs in HA mode, even if only having one replica. I don't want to change this
in any helm installation or upgrade, all available zabbix server pods will start and be hold back by an init container, waiting for a database not only to be available but also to have the correct version
an additional job is being started, based on the "zabbix_server" container image, which achieves the magic of preparing the database, and also to upgrade the schema in case a major release upgrade happened

This is almost exactly the same as it is designed to work right now, with the following changes:

this "after install/upgrade job" (indeed, I will probably change this to be a post-upgrade / post-install hook in the helm chart) will get one additional task: the upgrade of the db schema which is being performed by zabbix_server
the init container coming up with any zabbix server pod will not only wait for availability of the db, but also for the right version

That should work fine then and without manual intervention.

Of course, it would be awesome if Zabbix themselves would implement a zabbix_server --only-upgrade-db or something, so that this Job container could be less hacky. I will probably try to get into discussion with the "right people" and try to convince them make our lives easier.

Stay tuned, an upgrade will come.

spectroman commented 2 months ago

Hi @fibbs , I wonder if you managed to raise the issue with Zabbix SIA, if there is a support ticket we could upvote?

I am facing the same problem here, although I don't use this helm project, I have my own methodology with different specs. And I got stuck also with the problem.

I went around looking if someone had found a solution and I see this ticket here and something related on the zabbix forums, to no avail anyhow.

I came up with some ideas, but absolutely the best solution would be a flag with --only-upgrade-db kinda switch, provided by them.

As I compile my own binaries and build my own images, I was thinking that I could snoop in the source code and catch the latest DBPATCH_VERSION(integer) and flag it on the entrypoint, check with the database if it requires an update, do the necessary changes, bail the zabbix_server when that is finished, add back the ha configuration, restart pod...

But this is so ugly that I am not really happy pursuing it, so maybe, I would patch zabbix source code myself to build the image if I see that Zabbix SIA will take a long time to release a solution for it.

In the end I also find it beneficial to add a new status on the HA node to inform other nodes that the database is under upgrade, they would just back off, until the node executing upgrades would just mark it finished and/or assume an active role -- that would fix the problem to avoid having a "only-upgrade-db" switch but would incur in a larger patch.

If its possible , I would be glad to know the status of the conversation with Zabbix SIA and about the ticket... and I will also decide if I go forward writing / using a patched zabbix_server binary.

zabbix-community / helm-zabbix

upgrade issue #42