zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.34k stars 979 forks source link

aws eks 1.21 Bound Service Account Token Volume fails postgres-operator and pods runs into readonly mode #1904

Open kost2191 opened 2 years ago

kost2191 commented 2 years ago

After upgrading to 1.21 eks AWS we fased issue of outdated serviceaccount token (https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens). Postgresql-operator is set to use podtgres-pod serviceaccount. After 90 days after upgrading eks cluster pods that are 90d old faced this error in postgres pods:

2022-05-25 00:31:10,670 ERROR: Unexpected error from Kubernetes API
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 481, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 1012, in touch_member
    ret = self._api.patch_namespaced_pod(self._name, self._namespace, body)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 466, in wrapper
    return getattr(self._core_v1_api, func)(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 402, in wrapper
    return self._api_client.call_api(method, path, headers, body, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 371, in call_api
    return self._handle_server_response(response, _preload_content)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 201, in _handle_server_response
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (401)
Reason: Unauthorized

and this one:

2022-05-25 02:33:10,501 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:10,501 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:11,507 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:11,508 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:12.222 39 LOG {ticks: 0, maint: 0, retry: 0}
2022-05-25 02:33:12,513 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:12,514 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:13,524 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:13,525 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:14,532 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:14,532 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:15,547 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:15,547 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:16,564 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:16,565 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:17,572 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:17,572 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:18,380 ERROR: get_cluster
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 701, in _load_cluster
    self._wait_caches(stop_time)
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 693, in _wait_caches
    raise RetryFailedError('Exceeded retry deadline')
patroni.utils.RetryFailedError: 'Exceeded retry deadline'
2022-05-25 02:33:18,380 ERROR: Error communicating with DCS
2022-05-25 02:33:18,381 INFO: DCS is not accessible
2022-05-25 02:33:18,382 WARNING: Loop time exceeded, rescheduling immediately.
2022-05-25 02:33:18,580 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:18,581 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:19,591 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:19,591 ERROR: ObjectCache.run ApiException()

Is there any option to set refresh time for tokens? We solved it deleting pods one by one, but this is not an option in long run

kost2191 commented 2 years ago

Further investigation: We found such commit in zalando/patroni: https://github.com/zalando/patroni/commit/aa0cd480604069519ebd9b52b0d629e33287341c seems like this one is refreshing needed token, but this commit is only in master without any release, so spilo image is not using it, too. I'll ask it in issues in patroni too

onelapahead commented 2 years ago

Begs the question as to why patroni isn't using the official Python client for Kubernetes as that would have solved / supported automatically after version 12.0.0 (latest version is 24.2.0) but will reserve further thoughts / comments on that for threads in that repo.

That aside, it looks like this was released now in Patroni 2.1.4: https://github.com/zalando/patroni/blob/master/docs/releases.rst#version-214

Spilo 2.1-p6 is then which release that uses it: https://github.com/zalando/spilo/releases/tag/2.1-p6

So presumably either upgrading to https://github.com/zalando/postgres-operator/releases/tag/v1.8.2 where 2.1-p6 is the default image, or using .spec.dockerImage to override it may work: https://github.com/zalando/postgres-operator/blob/3bfd63cbe624eb303d40f6e511e987f4343bb1d7/pkg/controller/operator_config.go#L42

We will take the approach of upgrading the chart and confirm the latest Spilo / Patroni is automatically applied.

DotNetPart commented 1 year ago

Hi, any update on the issue?

kost2191 commented 1 year ago

We just built new spilo-patroni image and used it. I think this problem is already solved in newer versions of patroni, so just update your version