zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.24k stars 968 forks source link

Metadata annotation "history" in postgres's endpoint is too long #1967

Open danlenar opened 2 years ago

danlenar commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

2022-07-14 20:15:56,317 ERROR: Unexpected error from Kubernetes API
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 483, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 877, in patch_or_create
    return self._patch_or_create(name, annotations, resource_version, patch, retry, ips)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 868, in _patch_or_create
    ret = retry(func, self._namespace, body) if retry else func(self._namespace, body)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 468, in wrapper
    return getattr(self._core_v1_api, func)(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 404, in wrapper
    return self._api_client.call_api(method, path, headers, body, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 373, in call_api
    return self._handle_server_response(response, _preload_content)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 203, in _handle_server_response
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': '36d5bda8-8944-497c-9559-f101567e2bcf', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '02dcfdea
-2f55-4b2a-a6c2-99f41f1ab800', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'fe9a36c3-4d6a-41b7-97b0-3f6346defa3e', 'Date': 'Thu, 14 Jul 2022 20:15:56 GMT', 'Content-Length': '761'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Endpoints \\"postgres-postgres-config\\" is invalid: metadata.annotations: Too long: must have at most 2621
44 bytes","reason":"Invalid","details":{"name":"postgres-postgres-config","kind":"Endpoints","causes":[{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.anno
tations"},{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"},{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes",
"field":"metadata.annotations"},{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"}]},"code":422}\n'

k -n <> get ep postgres-postgres-config

apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    config: '{"loop_wait":10,"maximum_lag_on_failover":33554432,"postgresql":{"parameters":{"archive_mode":"on","archive_timeout":"1800s","autovacuum_analyze_scale_factor":0.02,"autovacuum_max_workers":5,"au>
      [%p]: [%l-1] %c %x %d %u %a %h ","log_lock_waits":"on","log_min_duration_statement":500,"log_statement":"ddl","log_temp_files":0,"max_connections":"100","max_replication_slots":10,"max_wal_senders":10,>
      all all trust","host all all 127.0.0.1/32 md5","host all all ::1/128 md5","hostssl
      replication standby all md5","hostssl all +zalandos all pam","hostssl all all
      all md5","hostnossl all all all md5"]}'
    history: '[[1,100663456,"no recovery target specified","2021-02-11T19:54:09+00:00"],[2,117440672,"no
      recovery target specified"],[3,2030061304,"no recovery target specified","2021-02-20T17:17:52+00:00"],[4,8455716864,"no
      recovery target specified","2021-02-28T18:09:33+00:00"],[5,22498246816,"no recovery
      target specified","2021-03-18T04:29:35+00:00"],[6,22515024032,"no recovery target
      specified","2021-03-18T04:31:26+00:00"],[7,22531801248,"no recovery target specified","2021-03-18T04:55:43+00:00"],[8,27095204000,"no
      recovery target specified","2021-03-23T20:47:12+00:00"],[9,27128758432,"no recovery
      target specified","2021-03-23T21:46:39+00:00"],[10,29544677536,"no recovery
      target specified","2021-03-26T21:36:58+00:00"],[11,29561454752,"no recovery
      target specified","2021-03-26T21:38:40+00:00"],[12,29578231968,"no recovery
      target specified","2021-03-26T21:54:36+00:00"],[13,29578231968,"no recovery
      target specified","2021-03-26T21:55:18+00:00"],[14,29611786400,"no recovery
      target specified"],[15,29628563616,"no recovery target specified","2021-03-26T22:17:30+00:00"],[16,29645340832,"no
      recovery target specified","2021-03-26T22:28:14+00:00"],[17,43504541880,"no
      recovery target specified","2021-04-13T03:56:18+00:00"],[18,43509265984,"no
      recovery target specified","2021-04-13T04:28:02+00:00"],[19,43512271984,"no
      recovery target specified","2021-04-13T04:37:40+00:00"],[20,43514796584,"no
      recovery target specified","2021-04-13T04:50:35+00:00"],[21,43520405216,"no
      recovery target specified","2021-04-13T05:30:00+00:00"],[22,43522166616,"no
      recovery target specified","2021-04-13T05:44:34+00:00"],[23,43525341240,"no
      recovery target specified","2021-04-13T06:02:48+00:00"],[24,43528604672,"no
      recovery target specified","2021-04-13T06:09:21+00:00"],[25,43536875680,"no
      recovery target specified","2021-04-13T06:18:51+00:00"],[26,50717524128,"no
      recovery target specified","2021-04-22T03:51:55+00:00"],[27,50734301344,"no
      recovery target specified","2021-04-22T04:05:34+00:00"],[28,50818187424,"no
      recovery target specified","2021-04-22T06:13:53+00:00"],[29,54626615456,"no
      recovery target specified","2021-04-26T23:19:57+00:00"],[30,73316434080,"no
      recovery target specified","2021-05-20T04:01:45+00:00"],[31,73317607192,"no
      recovery target specified","2021-05-20T04:14:58+00:00"],[32,101502156960,"no
      recovery target specified","2021-06-24T03:40:38+00:00"],[33,101518934176,"no
      ...
CyberDem0n commented 2 years ago

Oh, it seems that you have a lot of failovers... It is possible to reduce the number of history lines stored in the annotation by using max_timelines_history parameter.

wasap commented 2 years ago

i have the same issue. tried to set max_timelines_history: 10 in patronictl edit-config. restarted all db pods, even restarted postgres operator pod, delete endpoint config acid-prod-api also tried to add to database yaml cofig

postgresql:
    parameters:
      max_timelines_history: "10"

but still getting this error

ERROR: Unexpected error from Kubernetes API
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 483, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 877, in patch_or_create
    return self._patch_or_create(name, annotations, resource_version, patch, retry, ips)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 868, in _patch_or_create
    ret = retry(func, self._namespace, body) if retry else func(self._namespace, body)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 468, in wrapper
    return getattr(self._core_v1_api, func)(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 404, in wrapper
    return self._api_client.call_api(method, path, headers, body, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 373, in call_api
    return self._handle_server_response(response, _preload_content)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 203, in _handle_server_response
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'c8949d3d-9984-421a-ad33-3a62c453fd6c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'ca267a38-8ca9-4f84-8fd7-25684e895f05', 'X-Kubernetes-Pf-Prioritylevel-Uid': '80ccb9a2-95f6-47fb-bb00-dd34f2f81d54', 'Date': 'Mon, 18 Jul 2022 09:55:39 GMT', 'Content-Length': '753'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Endpoints \\"acid-prod-api-config\\" is invalid: metadata.annotations: Too long: must have at most 262144 bytes","reason":"Invalid","details":{"name":"acid-prod-api-config","kind":"Endpoints","causes":[{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"},{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"},{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"},{"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"}]},"code":422}\n'

looks like it is cached somewhere and nothing helps. do you have an idea where can i clean it?

vbortnikov commented 2 years ago

looks like it is cached somewhere and nothing helps. do you have an idea where can i clean it? Hi! I fixed it manually: kubectl exec into pod and run "patronictl edit-config" "maximum_lag_on_failover" belongs patroni layer (config) not postgres. List of parameters in manifest is limited patroni-parameters From here Good luck!

wasap commented 2 years ago

patronictl edit-config

thank you very much. it helped

ADiTuri commented 1 year ago

[SOLVED]

Hi guys thanks for the help!

We are also running the postgres operator and we had the same exception being thrown.

We followed the steps you provided:

But still we are getting the same error: metadata.annotations: Too long: must have at most 262144 bytes even though the file is quite small now less than 1000 bytes. We tried uninstalling the operator and installing it back again but it did not solve the issue.

We seem to have the same "cached somewhere problem" as @wasap. We read the last comment from @vbortnikov but we could not understand if we had to set maximum_lag_failover and what value in case.

Any other suggestions?


We were putting the max_timelines_history parameter in the wrong place it goes in the outer part of the config:

max_timelines_history: 10
maximum_lag_on_failover: 33554432
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02

We were wondering were the config is cached cause we modified manually in k8s but this was not enough.

Happy coding

wasap commented 1 year ago

[SOLVED]

Hi guys thanks for the help!

We are also running the postgres operator and we had the same exception being thrown.

We followed the steps you provided:

  • manually shortened endpoint metadata via kubectl -n tefde-bmi-ci-infra edit ep postgres-hive-metastore-config -o yaml (This enabled changing the config via patronictl)
  • changed the config via patronictl (this was initially not possible because we were getting the same metadata-too-long exception), adding the max_timelines_history: 10

But still we are getting the same error: metadata.annotations: Too long: must have at most 262144 bytes even though the file is quite small now less than 1000 bytes. We tried uninstalling the operator and installing it back again but it did not solve the issue.

We seem to have the same "cached somewhere problem" as @wasap. We read the last comment from @vbortnikov but we could not understand if we had to set maximum_lag_failover and what value in case.

Any other suggestions?

We were putting the max_timelines_history parameter in the wrong place it goes in the outer part of the config:

max_timelines_history: 10
maximum_lag_on_failover: 33554432
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02

We were wondering were the config is cached cause we modified manually in k8s but this was not enough.

Happy coding

Connect to each pod with kubectl exec ... run patronictl edit-config and add there max_timelines_history: 10

matejkostros commented 1 year ago

We are facing same issue in our kubernetes deployment. Could this max_timelines_history parameter be included to postgres manifest? Currently we are able to set only these. However having the max_timelines_history set to some specific value, or if default has finite value, we would prevent issues with large manifests: {"reason":"FieldValueTooLong","message":"Too long: must have at most 262144 bytes","field":"metadata.annotations"}

sj-porter-knime commented 2 weeks ago

We ran into this issue as well. Seems to be that the Postgres Operator was stuck in a reconcile loop due to some bad node affinity settings which is likely(?) the cause for why it built up a ridiculously large history (thousands of entries in the history annotation).

Had to scale down the Postgres Operator and Postgres Deployment, remove the annotation history from the endpoint, then scale the services back up and apply the max_timelines_history: 10 configuration manually. Thank you @wasap and others for the solution there.

It would be really nice to have a way to set this permanently, or even just by default - what is this history even used for?