zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.15k stars 950 forks source link

Add fsgroupchangepolicy option #1850

Open xhejtman opened 2 years ago

xhejtman commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Would it be possible to add fsGroupChangePolicy option to the security context of created postgres statefulset?

Sometimes, kubelet changes access rights of data so that fsGroup GID can read/write data which postgres dislikes as security issue. This can be avoided by setting fsGroupChangePolicy: OnRootMismatch, so recursive chmod does not happen. This seems to be not possible currently so adding it as an option would be appreciated.

stephan2012 commented 1 year ago

We occasionally face invalid permissions issues:

2022-09-12 07:09:55,610 INFO: doing crash recovery in a single user mode
2022-09-12 07:09:55,631 ERROR: Crash recovery finished with code=1
2022-09-12 07:09:55,631 INFO:  stdout=
2022-09-12 07:09:55,631 INFO:  stderr=2022-09-12 07:09:55 UTC [31680]: [1-1] 631edb43.7bc0 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2022-09-12 07:09:55 UTC [31680]: [2-1] 631edb43.7bc0 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

2022-09-12 07:09:56.363 36 LOG {ticks: 0, maint: 0, retry: 0}

Unfortunately, we currently do not know how to reproduce the error. However, a GitHub issue for the Crunchy Postgres Operator suggests that setting fsGroupChangePolicy to OnRootMismatch fixes it. Unfortunately, there seems to be no way to adjust the security context configuration with the Zalando Postgres Operator.

stephan2012 commented 1 year ago

Looks like we were hit by issue #1703: I/O performance issues caused the Kubernetes control plane to restart, triggering this issue. In terms of resilience, it would be helpful to configure fsGroupChangePolicy for the database StatefulSet.

stephan2012 commented 1 year ago

/assign

Lima118 commented 7 months ago

@stephan2012 How do you get around this issue, when it comes up? I had to recover my Postgres cluster, but when I did I got the mentioned Permission issue. This did not happen before, when I had to recover the Postgres cluster. Now I can't get it to start. Restart doesn't seem to help.

stephan2012 commented 7 months ago

@stephan2012 How do you get around this issue, when it comes up?

You can manually fix the directory permissions by shelling into the Pod and running chmod. Look at the container logs to see what Patroni expects. Unfortunately, my PR is still waiting for a response from the maintainers.

Lima118 commented 7 months ago

@stephan2012 How do you get around this issue, when it comes up?

You can manually fix the directory permissions by shelling into the Pod and running chmod. Look at the container logs to see what Patroni expects. Unfortunately, my PR is still waiting for a response from the maintainers.

It was tricky for me because it renamed the data directory after it failed to start. I had to get the exact moment where it created the data directory and had to modify the permissions on it very fast or else it failed, renamed the directory and started bootstrap again. But thank you very much, this was the solution just a bit more tricky in my case.