zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.55k stars 384 forks source link

seteuid: Operation not permitted but cluster is up and running fine with PG 1.6 #562

Closed neelasha-09 closed 3 years ago

neelasha-09 commented 3 years ago

Hi Team,

We are using the new PG version 1.6. Below are the images used for OPR and Cluster on OCP environment.

OPR :  v1.6.1-23-g2efa8312
Cluster: spilo-13:2.0-p5

We see the Cluster is up and running fine, but there are still errors in cluster logs seteuid: Operation not permitted. Logs attached.

Logs.txt

Permissions inside cluster:

postgres@postgres-operator-cluster-1-0:~$ id
uid=1000600000(postgres) gid=0(root) groups=0(root),1000600000

Could you please support?

CyberDem0n commented 3 years ago

List of processes in the container might help to figure it out. Please run ps auxwf and copy the output here.

neelasha-09 commented 3 years ago

Please find the output requested.

postgres@postgres-operator-cluster-1-0:~$ ps auxwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres    1390  0.3  0.0   4640   840 pts/0    Ss   07:30   0:00 /bin/sh -c TERM="xterm" /bin/sh
postgres    1397  0.0  0.0   4640   840 pts/0    S    07:30   0:00  \_ /bin/sh
postgres    1399  0.0  0.0  22044  4164 pts/0    S    07:30   0:00      \_ bash
postgres    1417  0.0  0.0  37812  3336 pts/0    R+   07:30   0:00          \_ ps auxwf
postgres       1  0.0  0.0   4396   824 ?        Ss   07:12   0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh
postgres      10  0.0  0.0   4640  1752 ?        S    07:12   0:00 /bin/sh /launch.sh
postgres      32  0.0  0.0   4564   740 ?        S    07:12   0:00  \_ /usr/bin/runsvdir -P /etc/service
postgres      33  0.1  0.0   4412  1288 ?        Ss   07:12   0:01      \_ runsv cron
postgres      34  0.0  0.0   4412   800 ?        Ss   07:12   0:00      \_ runsv pgqd
postgres      38  0.0  0.0 108012  8136 ?        S    07:12   0:00      |   \_ /usr/bin/pgqd /home/postgres/pgq_ticker.ini
postgres      35  0.0  0.0   4412   856 ?        Ss   07:12   0:00      \_ runsv patroni
postgres      37  0.2  0.1 620216 38664 ?        Sl   07:12   0:02          \_ /usr/bin/python3 /usr/local/bin/patroni /home/postgres/postgres.yml
postgres      73  0.0  0.0 320592 30636 ?        S    07:13   0:00 /usr/lib/postgresql/13/bin/postgres -D /home/postgres/pgdata/pgroot/data --config-file=/home/postgres/pgdata/pgroot/data/p
postgres      75  0.0  0.0 200256  4676 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: logger
postgres      78  0.3  0.0 422180 25376 ?        Ssl  07:13   0:04  \_ postgres: postgres-operator-cluster-1: bg_mon
postgres      83  0.0  0.0 320688 15888 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: checkpointer
postgres      84  0.0  0.0 320576  6748 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: background writer
postgres      85  0.1  0.0 202712  5468 ?        Ss   07:13   0:01  \_ postgres: postgres-operator-cluster-1: stats collector
postgres      87  0.0  0.0 321792 16980 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: postgres postgres [local] idle
postgres     103  0.0  0.0 320576  9016 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: walwriter
postgres     104  0.0  0.0 321264  8680 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: autovacuum launcher
postgres     105  0.0  0.0 202456  4720 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: archiver last was 000000270000000000000028.partial
postgres     106  0.0  0.0 321632 14500 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: pg_cron launcher
postgres     107  0.0  0.0 321116  8700 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: TimescaleDB Background Worker Launcher
postgres     108  0.0  0.0 321116  7196 ?        Ss   07:13   0:00  \_ postgres: postgres-operator-cluster-1: logical replication launcher
CyberDem0n commented 3 years ago

It fails to start /usr/sbin/cron, therefore backups are effectively broken.

neelasha-09 commented 3 years ago

What is the solution?

CyberDem0n commented 3 years ago

It seems that OCP starts the container with nosuid, therefore cron can't be started as root. Unfortunately it can't work as non-root user. The solution would be finding an alternative to the cron, that doesn't require a root to work.

neelasha-09 commented 3 years ago

What is the impact of missing these backups ? If HA and minor version upgrade is working fine

Samusername commented 3 years ago

Would running in privileged mode have effect? ( Not preferred to run in such mode, but would it affect. )

CyberDem0n commented 3 years ago

What is the impact of missing these backups ?

Well, I don't really know how to answer such questions... RAID is not a backup. HA is not a backup. Running replicas are not replacing the backup. Backup stored on the same machine is not a backup. Backup stored in the same DC is not a good backup. The backup that was never tested is a Schrödinger backup. And so on...

neelasha-09 commented 3 years ago

It seems to be AllowPrivilegeEscalation parameter in OPR 1.6 causing the problem.

We identified, the certificate rotation would be affected by not having cron running. We agree in the long run ideally we should move from cron dependency.