We're seeing repeated segmentation faults in the bg_mon sub module when trying to deploy postgres 16 via postgres-operator into kubernetes clusters using RHEL 9.4 images as the node OS causing postgres to get stuck in a recovery loop.
Issue occured when upgrading our postgres clusters from postgres 15 using docker image: ghcr.io/zalando/spilo-15:3.0-p1 to 16 and also with new installs.
Confirmed issue doesn't occur on RHEL 9.4 using ghcr.io/zalando/spilo-15:3.0-p1 and also doesn't seem to occur on RHEL 9.3 or 8.10
Environment
ghcr.io/zalando/postgres-operator:v1.12.0,
Kubernetes, seen on kURL and Rancher clusters with nodes running RHEL 9.4. Tested using RHEL 9.4 AMIs on AWS and seen with several customer installs on the mentioned distributions. I assume we'll see the same thing on other distros as well.
ghcr.io/zalando/spilo-16:3.2-p3 also tested ghcr.io/zalando/spilo-16:3.3-p2 with the same effect.
Velero is running in the clusters but not automatically backing up, my suspicion was that it was somehow causing the issue as it seems to occur less when removing velero pod annotations but still seems to occur occasionally.
Not really sure how to resolve this. I've attempted to isolate the configuration triggering this as it occurs always in initial installs of our environments but only occurs most of the time when deploying multiple postgres clusters in the same environment or a cluster with cut down configuration, it doesn't seems to be triggered by one specific thing I've spotted so far and doesn't appear at a consistent point in the logs to indicate something specific postgres was running triggered it.
Is bg_mon an essential module for spilo, is there any impact to removing it from the shared_preload_libraries list?
2024-09-27 12:11:16.137 UTC,,,72,,66f6a0b9.48,7,,2024-09-27 12:10:33 UTC,,0,LOG,00000,"background worker ""bg_mon"" (PID 78) was terminated by signal 11: Segmentation fault",,,,,,,,,"","postmaster",,0
We're seeing repeated segmentation faults in the bg_mon sub module when trying to deploy postgres 16 via postgres-operator into kubernetes clusters using RHEL 9.4 images as the node OS causing postgres to get stuck in a recovery loop.
https://github.com/CyberDem0n/bg_mon
Issue occured when upgrading our postgres clusters from postgres 15 using docker image: ghcr.io/zalando/spilo-15:3.0-p1 to 16 and also with new installs.
Confirmed issue doesn't occur on RHEL 9.4 using ghcr.io/zalando/spilo-15:3.0-p1 and also doesn't seem to occur on RHEL 9.3 or 8.10
Environment
Added a gist with the coredump stacktrace, our postgres-operator postgresql CR and some sample logs when the segfault first appears in pg_log. https://gist.github.com/renedamyon/6130ad4dd65edfbaeae6a43717f3adc2
Not really sure how to resolve this. I've attempted to isolate the configuration triggering this as it occurs always in initial installs of our environments but only occurs most of the time when deploying multiple postgres clusters in the same environment or a cluster with cut down configuration, it doesn't seems to be triggered by one specific thing I've spotted so far and doesn't appear at a consistent point in the logs to indicate something specific postgres was running triggered it.
Is bg_mon an essential module for spilo, is there any impact to removing it from the
shared_preload_libraries
list?2024-09-27 12:11:16.137 UTC,,,72,,66f6a0b9.48,7,,2024-09-27 12:10:33 UTC,,0,LOG,00000,"background worker ""bg_mon"" (PID 78) was terminated by signal 11: Segmentation fault",,,,,,,,,"","postmaster",,0
Thanks for any assistance, Rene