zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.32k stars 980 forks source link

Error on Openshift 3.11 in privileged mode #852

Closed dejwsz closed 4 years ago

dejwsz commented 4 years ago

I ran the operator in privileged mode just to check if it works - version "1.3.1". I used the same serviceaccount for PODs originally created by CSV. I added privileged scc to the account just to be sure all permissions are there. And finally, my test cluster was created but such error was shown in logs and cluster ended up with SyncFailed status.

... 2020-03-02 14:23:44,915 INFO: Lock owner: test-minimal-cluster-0; I am test-minimal-cluster-0

  | 2020-03-02 14:23:44,915 INFO: establishing a new patroni connection to the postgres cluster   | 2020-03-02 14:23:44,927 ERROR: Permission denied   | Traceback (most recent call last):   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 61, in wrapper   | return func(*args, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 282, in patch_or_create   | return self.retry(func, self._namespace, body) if retry else func(self._namespace, body)   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 114, in retry   | return self._retry.copy()(*args, *kwargs)   | File "/usr/local/lib/python3.6/dist-packages/patroni/utils.py", line 313, in call   | return func(args, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 50, in wrapper   | return getattr(self._api, func)(args, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 15602, in patch_namespaced_endpoints   | (data) = self.patch_namespaced_endpoints_with_http_info(name, namespace, body, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 15698, in patch_namespaced_endpoints_with_http_info   | collection_formats=collection_formats)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 335, in call_api   | _preload_content, _request_timeout)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 148, in __call_api   | _request_timeout=_request_timeout)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 409, in request   | body=body)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 307, in PATCH   | body=body)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 240, in request   | raise ApiException(http_resp=r)   | kubernetes.client.rest.ApiException: (403)   | Reason: Forbidden   | HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-store', 'Content-Type': 'application/json', 'Date': 'Mon, 02 Mar 2020 14:23:44 GMT', 'Content-Length': '267'})   | HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \"test-minimal-cluster\" is forbidden: endpoint address 10.128.3.34 is not allowed","reason":"Forbidden","details":{"name":"test-minimal-cluster","kind":"endpoints"},"code":403}   |     |     | 2020-03-02 14:23:44,927 ERROR: failed to update leader lock   | 2020-03-02 14:23:44,962 INFO: not promoting because failed to update leader lock in DCS   | 2020-03-02 14:23:54,917 INFO: Lock owner: test-minimal-cluster-0; I am test-minimal-cluster-0   | 2020-03-02 14:23:54,921 ERROR: Permission denied   | Traceback (most recent call last):   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 61, in wrapper   | return func(args, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 282, in patch_or_create   | return self.retry(func, self._namespace, body) if retry else func(self._namespace, body)   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 114, in retry   | return self._retry.copy()(*args, *kwargs)   | File "/usr/local/lib/python3.6/dist-packages/patroni/utils.py", line 313, in call   | return func(args, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 50, in wrapper   | return getattr(self._api, func)(*args, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 15602, in patch_namespaced_endpoints   | (data) = self.patch_namespaced_endpoints_with_http_info(name, namespace, body, kwargs)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 15698, in patch_namespaced_endpoints_with_http_info   | collection_formats=collection_formats)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 335, in call_api   | _preload_content, _request_timeout)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 148, in __call_api   | _request_timeout=_request_timeout)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 409, in request   | body=body)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 307, in PATCH   | body=body)   | File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 240, in request   | raise ApiException(http_resp=r)   | kubernetes.client.rest.ApiException: (403)   | Reason: Forbidden   | HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-store', 'Content-Type': 'application/json', 'Date': 'Mon, 02 Mar 2020 14:23:54 GMT', 'Content-Length': '267'})   | HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \"test-minimal-cluster\" is forbidden: endpoint address 10.128.3.34 is not allowed","reason":"Forbidden","details":{"name":"test-minimal-cluster","kind":"endpoints"},"code":403} ....

FxKu commented 4 years ago

Users already reported the lack of Enpoints in OpenShift.

dejwsz commented 4 years ago

ups OK, sorry I will take a look there, thx

dejwsz commented 4 years ago

I tried the thing with "pod_environment_configmap" and setting PATRONI_KUBERNETES_USE_ENDPOINTS to false. No luck with this.

flickerfly commented 4 years ago

Read the next comment down from that and you'll see some further information like:

correct env name for spilo is currently KUBERNETES_USE_CONFIGMAPS

dejwsz commented 4 years ago

Indeed. But spilo-role is not assigned so "pod_role_label" is not working. Any trick here?

dejwsz commented 4 years ago

I saw such warning in log:

2020-03-03 10:18:45,606 - bootstrapping - WARNING - could not parse kubernetes labels as a JSON: Expecting value: line 1 column 1 (char 0), reverting to the default: {"application": "spilo"}

and two CRIT

2020-03-03 10:18:46,711 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message. 2020-03-03 10:18:46,721 CRIT Server 'unix_http_server' running without any HTTP authentication checking

but in general, it looks like it's starting ok but it does not label pods as it should - "spilo-role" is not assigned at all.

dejwsz commented 4 years ago

I can see KUBERNETES_ROLE_LABEL='spilo-role' is set in cluster PODs environments but the role never gets assigned to master or replica PODs. Some kind of bug?

dejwsz commented 4 years ago

Is there any way to enforce DEBUG level for Spilo out there? I tried to set "debug_logging: true" in operator config but it didn't help. Having debug mode for Spilo I could see more detailed messages to see if "Changing the pod's role to" message appears because then the spilo role label should be set also but it does not happen.

dejwsz commented 4 years ago

OK I see - it is enough to add DEBUG=true to config map pointed by "pod_environment_configmap"

dejwsz commented 4 years ago

I can see this while bootstrapping:

2020-03-03 15:05:09,044 - bootstrapping - DEBUG - b"Can't load /root/.rnd into RNG\n139974210200000:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:88:Filename=/root/.rnd\nGenerating a RSA private key\n.+++++\n..................................................................................................................................................................................................+++++\nwriting new private key to '/home/postgres/server.key'\n-----\n"

but later there is no message "Changing the pod's role to". So it means it never tries to assign master or replica role at all. So it does not work well even in the privileged mode under Openshift.

dejwsz commented 4 years ago

So one POD shows this all the time later: 2020-03-03 15:07:26,126 INFO: waiting for leader to bootstrap 2020-03-03 15:07:36,125 INFO: Lock owner: None; I am test-minimal-cluster-1

and the second this: 2020-03-03 10:19:58,259 INFO: waiting for leader to bootstrap 2020-03-03 10:20:08,259 INFO: Lock owner: None; I am test-minimal-cluster-0

Any idea how to run it?

dejwsz commented 4 years ago

Still no luck but got new error:

Traceback (most recent call last):

  | File "/scripts/callback_endpoint.py", line 9, in   | from kubernetes import client as k8s_client, config as k8s_config   | ModuleNotFoundError: No module named 'kubernetes'

And what is interesting after reinstalling everything and using combination of operator version and spilo image: registry.opensource.zalan.do/acid/postgres-operator:v1.3.1 registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p16 I have now properly assigned labels (so spilo-role=master and spilo-role=replica are in place).

dejwsz commented 4 years ago

Interesting - I saw master service was broken - no selectors. So I remove the cluster and later created it once again. I fixed the master service by adding missing selectors. And surprise - this time labels were no assigned to PODs again. So didn't work because of this.

dejwsz commented 4 years ago

After another try - so cleanup and adding the test cluster again - I had all labels in place and master service fixed and finally cluster in Running state: NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS test-minimal-cluster TEST 11 2 1Gi 8m Running So it works for me in privileged mode on Openshift 3.11, ufff. I will try now to do it in restricted mode.

dejwsz commented 4 years ago

My steps to run simple postgres cluster in Openshift 3.11 in privileged mode. I installed OLM "0.14.1" and postgres-operator in version "1.3.0" (image was replaced later to version "1.3.1").

  1. Create "postgresql-operator-default-configuration" in olm namespace.

    apiVersion: "acid.zalan.do/v1"
    kind: OperatorConfiguration
    metadata:
    name: postgresql-operator-default-configuration
    configuration:
    docker_image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p16
    max_instances: 3
    min_instances: 1
    resync_period: 30m
    repair_period: 5m
    workers: 4
    users:
    replication_username: standby
    super_username: postgres
    kubernetes:
    cluster_domain: cluster.local
    cluster_labels:
      application: spilo
    cluster_name_label: cluster-name
    cluster_history_entries: "1000"
    enable_init_containers: true
    enable_pod_antiaffinity: true
    enable_pod_disruption_budget: false
    enable_sidecars: true
    enable_shm_volume: true
    inherited_labels:
      - application
      - environment
    pdb_name_format: "postgres-{cluster}-pdb"
    pod_antiaffinity_topology_key: "failure-domain.beta.kubernetes.io/zone"
    pod_management_policy: ordered_ready
    pod_role_label: spilo-role
    pod_terminate_grace_period: 5m
    secret_name_template: "{username}.{cluster}.credentials.{tprkind}.{tprgroup}"
    toleration: {}
    spilo_privileged: true
    watched_namespace: "olm"
    pod_environment_configmap: "pod-env-cfg"
    postgres_pod_resources:
    default_cpu_limit: "2"
    default_cpu_request: "250m"
    default_memory_limit: "2Gi"
    default_memory_request: "250Mi"
    timeouts:
    pod_label_wait_timeout: 10m
    pod_deletion_wait_timeout: 10m
    ready_wait_interval: 5s
    ready_wait_timeout: 30s
    resource_check_interval: 5s
    resource_check_timeout: 10m
    load_balancer:
    enable_master_load_balancer: false
    enable_replica_load_balancer: false
    master_dns_name_format: "{cluster}.{team}.{hostedzone}"
    replica_dns_name_format: "{cluster}-repl.{team}.{hostedzone}"
    aws_or_gcp:
    aws_region: my-region
    logical_backup:
    logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup"
    logical_backup_s3_access_key_id: "my-accees-key"
    logical_backup_s3_bucket: "spilo-backup"
    logical_backup_s3_endpoint: "my-enpoint"
    logical_backup_s3_secret_access_key: "my-secret"
    logical_backup_s3_sse: "AES256"
    logical_backup_schedule: "*/5 * * * *"
    debug:
    debug_logging: true
    enable_database_access: true
    teams_api:
    enable_team_superuser: false
    enable_teams_api: false
    pam_role_name: teamapipostgres
    protected_role_names:
    - admin
    team_admin_role: admin
    team_api_role_configuration:
      log_statement: all
    logging_rest_api:
    api_port: 8008
    cluster_history_entries: 1000
    ring_log_lines: 100
  2. Create config map "pod-env-cfg" in olm with:

    DEBUG: "true"
    KUBERNETES_USE_CONFIGMAPS: "true"
    PATRONI_KUBERNETES_ROLE_LABEL: spilo-role
  3. Add privileged to account:

    oc adm policy add-scc-to-user privileged -n olm -z operator
  4. Edit role postgres-operator.v1.3.0-XXXX in olm: and change configmaps perms:

    - apiGroups:
      - ''
    resources:
      - configmaps
    verbs:
      - get
      - list
      - create
      - patch
      - update
      - watch

    I also added 'update' to endpoints and pods.

  5. Edit postgres-operator CSV and change/add this:

              spec:
                containers:
                  - env:
                    - name: POSTGRES_OPERATOR_CONFIGURATION_OBJECT
                      value: postgresql-operator-default-configuration
                    image: 'registry.opensource.zalan.do/acid/postgres-operator:v1.3.1'
  6. Create a new postgres cluster:

    apiVersion: "acid.zalan.do/v1"
    kind: postgresql
    metadata:
    name: test-minimal-cluster
    spec:
    teamId: "TEST"
    volume:
    size: 1Gi
    numberOfInstances: 2
    users:
    # database owner
    appadmin:
    - superuser
    - createdb
    # role for application foo
    appuser: []
    #databases: name->owner
    databases:
    appdb: appuser
    postgresql:
    version: "11"
  7. (Optional step - only if needed!) Right after it, fix master service adding selectors like:

    selector:
    application: spilo
    cluster-name: test-minimal-cluster
    spilo-role: master
  8. Wait until is Running (spilo-role label must be assigned to both PODs).

dejwsz commented 4 years ago

I tried the same with latest version of operator 1.4.0 and Spilo image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p2 and it does not work well. Labels are not assigned.

dejwsz commented 4 years ago

Operator in version 1.3.1 and Spilo image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p2 works fine in privileged mode too.

ReSearchITEng commented 4 years ago

@dejwsz -> can you confirm that in operator 1.3.1 the master service gets the "selector" part populated, and it's only 1.4.0 with this issue?

dejwsz commented 4 years ago

I switched to other things and different operator (Crunchy), I do not know if I will find time for this soon. If yes I will give any feedback.

ReSearchITEng commented 4 years ago

For running in OpenShift, (including non-root mode): Operator Image should be at least: registry.opensource.zalan.do/acid/postgres-operator:v1.4.0-21-g1249626-dirty Operator should be configured with these values: kubernetes_use_configmaps: "true" docker_image: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p114 #or newer

To get the latest versions of the images of both operator and spilo, do: https://registry.opensource.zalan.do/v2/acid/postgres-operator/tags/list https://registry.opensource.zalan.do/v2/acid/spilo-cdp-12/tags/list

FxKu commented 4 years ago

Thanks @ReSearchITEng for providing infos about rootless Spilo and new operator options. Closed it now.