zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.2k stars 963 forks source link

Wrong ownership error on /home/postgres/pgdata/pgroot/data while using with Amazon EFS #2378

Open manoj016 opened 1 year ago

manoj016 commented 1 year ago

config

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  labels:
    app.kubernetes.io/managed-by: Helm
  name: timescaledb
  namespace: default
spec:
  databases:
    tsdb: zalando
  numberOfInstances: 3
  patroni:
    pg_hba:
      - host    all             all          0.0.0.0/0          md5
      - local   all             all                                   trust
      - hostssl all             +zalandos    127.0.0.1/32       pam
      - hostssl all             +zalandos    ::1/128            pam
      - local   replication     standby                    trust
      - hostssl replication     standby all                md5
      - hostssl all             +zalandos    all                pam
      - host    all             all                127.0.0.1/32       md5
      - host    all             all                ::1/128            md5
      - hostssl all             all                all                md5
  postgresql:
    parameters:
      max_connections: '4000'
      shared_buffers: 256MB
    version: '14'
  resources:
    limits:
      cpu: 4000m
      memory: 4Gi
    requests:
      cpu: 1000m
      memory: 1Gi
  teamId: acid
  users:
    zalando:
      - superuser
      - createdb
  preparedDatabases:
    zalando:
      extensions:
        timescaledb: public
      schemas:
        public:
          defaultRoles: false
  volume:
    size: 200Gi
    storageClass: efs-cs

I am using Amazon EFS for volume mount.

I am getting the below error in every replica.

creating directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 20
selecting default shared_buffers ... 400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... 2023-07-19 10:39:27.157 UTC [426] FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has wrong ownership
2023-07-19 10:39:27.157 UTC [426] HINT:  The server must be started by the user that owns the data directory.
child process exited with exit code 1
initdb: removing data directory "/home/postgres/pgdata/pgroot/data"
pg_ctl: database system initialization failed
2023-07-19 10:39:27,322 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 143, in main
    return patroni_main()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 135, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 105, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 108, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1537, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1399, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1291, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1284, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: exceeded maximum number of restarts 5
stopping /etc/runit/runsvdir/default/patroni
timeout: finish: .: (pid 427) 8s, want down

I gave 700 and 750 permission to this directory home/postgres/pgdata, still it is not working.

But with the local provisioner storage class instead of Amazon EFS works fine.

andrewstuart commented 1 year ago

Have you tried setting spiloFSGroup: 103 in your spec? I'm not on EFS but Ceph, and IIRC needed this to get things working.

Also, isn't EFS essentially NFS? I'd personally stick with block storage and not filesystems for something like a database.