Open DeamonMV opened 4 years ago
Is this timeout of 0s minutes expired
the only message you see? Did you configured the 0s
? So immediate timeout? :smiley:
Looking at the code makes me wonder, if the move is even tried once. Especially, when you don't see the inner error message in the logs.
And btw, use the v1.3.1
image, please.
@FxKu thank you for quick response)
Is this timeout of 0s minutes expired the only message you see?
Yes, only this. As a first step - I do unlabel, wait for 5 sec and do cordon, after this i see this message and that all.
Did you configured the 0s? So immediate timeout? 😃
I'm use "default" configuration, copy-pasted from Github. Section with timeouts looks like this
timeouts:
pod_label_wait_timeout: 10m
pod_deletion_wait_timeout: 10m
ready_wait_interval: 4s
ready_wait_timeout: 30s
resource_check_interval: 3s
resource_check_timeout: 10m
Updated operator to 1.3.1 - this didn't help.
hm, strange. The timeout responsible here is master_pod_move_timeout
and if you don't define it, it's 20m
by default. Maybe you can set a value then. Have to check, if I can reproduce it being 0s
(or maybe unset).
Just in case that is a full configuration of operator, which i used:
fyi: defining a master_pod_move_timeout
didn't helped.
apiVersion: acid.zalan.do/v1
configuration:
aws_or_gcp:
aws_region: eu-central-1
debug:
debug_logging: true
enable_database_access: true
docker_image: registry.opensource.zalan.do/acid/spilo-11:1.6-p1
etcd_host: ""
kubernetes:
cluster_domain: cluster.local
cluster_labels:
application: spilo
cluster_name_label: cluster-name
enable_pod_antiaffinity: false
enable_pod_disruption_budget: true
node_readiness_label:
lifecycle-status: ready
oauth_token_secret_name: postgresql-operator
pdb_name_format: postgres-{cluster}-pdb
pod_antiaffinity_topology_key: kubernetes.io/hostname
pod_management_policy: ordered_ready
pod_role_label: spilo-role
pod_service_account_name: postgres
pod_terminate_grace_period: 5m
secret_name_template: '{username}.{cluster}.credentials.{tprkind}.{tprgroup}'
spilo_privileged: false
watched_namespace: '*'
load_balancer:
enable_master_load_balancer: false
enable_replica_load_balancer: false
master_dns_name_format: '{cluster}.{team}.{hostedzone}'
replica_dns_name_format: '{cluster}-repl.{team}.{hostedzone}'
logging_rest_api:
api_port: 8008
cluster_history_entries: 1000
ring_log_lines: 100
logical_backup:
logical_backup_docker_image: registry.opensource.zalan.do/acid/logical-backup
logical_backup_s3_bucket: ""
logical_backup_schedule: 30 00 * * *
master_pod_move_timeout: 5m
max_instances: -1
min_instances: -1
postgres_pod_resources:
default_cpu_limit: "3"
default_cpu_request: 100m
default_memory_limit: 2Gi
default_memory_request: 100Mi
repair_period: 5m
resync_period: 30m
scalyr:
scalyr_cpu_limit: "1"
scalyr_cpu_request: 100m
scalyr_memory_limit: 1Gi
scalyr_memory_request: 50Mi
teams_api:
enable_team_superuser: false
enable_teams_api: false
pam_role_name: zalandos
protected_role_names:
- admin
team_admin_role: admin
team_api_role_configuration:
log_statement: all
timeouts:
pod_deletion_wait_timeout: 10m
pod_label_wait_timeout: 10m
ready_wait_interval: 4s
ready_wait_timeout: 30s
resource_check_interval: 3s
resource_check_timeout: 10m
users:
replication_username: standby
super_username: postgres
workers: 4
I think, this might be solved with #816 . Can you run a test with the latest
operator image @DeamonMV ?
Ok. Will do.
@FxKu I got the same thing
How I tested:
latest
lifecycle-status=ready
on node, on which running postgres master podConfiguration:
Containers:
postgres-operator:
Container ID: docker://d3c14796110533341c53c01cce122622822e3b40cf03eaf286fc2fcd5f0a3caa
Image: registry.opensource.zalan.do/acid/postgres-operator:latest
Image ID: docker-pullable://registry.opensource.zalan.do/acid/postgres-operator@sha256:deb4d2b716467d5e1b75d8f1724686370f50a7374e4f31f32b33364b1deef139
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 19 Feb 2020 16:29:29 +0200
Ready: True
# kubectl get operatorconfigurations.acid.zalan.do postgresql-operator-configuration -oyaml
apiVersion: acid.zalan.do/v1
configuration:
aws_or_gcp:
aws_region: eu-central-1
debug:
debug_logging: true
enable_database_access: true
docker_image: registry.opensource.zalan.do/acid/spilo-11:1.6-p1
etcd_host: ""
kubernetes:
cluster_domain: cluster.local
cluster_labels:
application: spilo
cluster_name_label: cluster-name
enable_pod_antiaffinity: true
enable_pod_disruption_budget: true
node_readiness_label:
lifecycle-status: ready
oauth_token_secret_name: postgresql-operator
pdb_name_format: postgres-{cluster}-pdb
pod_antiaffinity_topology_key: kubernetes.io/hostname
test-k8s-worker-1 Ready,SchedulingDisabled <none> 128d v1.13.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,ceph-role=worker,kubernetes.io/hostname=test-k8s-worker-1
test-k8s-worker-2 Ready <none> 128d v1.13.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,ceph-role=worker,kubernetes.io/hostname=test-k8s-worker-2,lifecycle-status=ready
test-k8s-worker-3 Ready <none> 128d v1.13.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,ceph-role=worker,kubernetes.io/hostname=test-k8s-worker-3,lifecycle-status=ready
time="2020-02-19T14:31:04Z" level=info msg="cluster has been synced" cluster-name=default/grafana-acid-postgres pkg=controller worker=0
time="2020-02-19T14:31:04Z" level=debug msg="cluster already exists" cluster-name=default/grafana-acid-postgres pkg=controller worker=0
time="2020-02-19T14:36:36Z" level=warning msg="failed to move master pods from the node \"test-k8s-worker-1\": timeout of 0s minutes expired" pkg=controller
Hello.
Any updates? I wondering because I need to update kubernetes cluster, and I use kubespray, and without this feature, it's little bit harder :)
I've added an e2e test for adding the node_readiness_label
to test failover. The PR also fixes the logging behavior to show why the move does not work and the timeout exceeds. One problem we found is, that the pod move is only triggered once on a node event. The operator does not retry to move the pod unless you restart it (because ADD node events on pod restart would trigger it again).
I still wonder though why the configured timeout doesn't show up in your logs. I would expect to see something like: timeout of 20m0s minutes expired
I wonder, if there is another marshalling issue, but I thought it was fixed with #816.
Hello.
What is a problem
Operator not able to move pod when node_readiness_label was deleted from k8s node and this node was cordoned.
What is my ENV
Part of OperatorConfiguration:
Postgres Operator image
registry.opensource.zalan.do/acid/postgres-operator:v1.3.0
That is how configured workers, as you can see one of them cordoned and does not have appropriate label
Pods: