Open oumkale opened 3 years ago
Having a stale role is absolutely normal in this case. Since you cut the network Patroni can't update its own labels, but at the same time, I am 100% sure that it demoted postgres to read-only.
There are 2 replicas master and slave so, After network loss for a couple of seconds throwing error:
psql: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?
But exporter service is not exporting metrics for nearly 90seconds. And a sudden hike in dashboard due to? @CyberDem0n
And sometimes no metrics from both the servers?
I can't add anything more. There is something weird with your config.
For example, your postgres-pg-exporter
will grab all spilo pods that belong to a team postgres.
Most of the time it is exporting but after multiple time(network loss) it can possible. Still didn't get why there is a hike in metrics after achiving role Please take a look in config:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
api_port: "8080"
aws_region: eu-central-1
cluster_domain: cluster.local
cluster_history_entries: "1000"
cluster_labels: application:spilo
cluster_name_label: cluster-name
connection_pooler_image: "registry.opensource.zalan.do/acid/pgbouncer:master-18"
db_hosted_zone: db.example.com
debug_logging: "true"
enable_sidecars: "true"
docker_image: registry.opensource.zalan.do/acid/spilo-13:2.1-p1
enable_ebs_gp3_migration: "false"
enable_master_load_balancer: "false"
enable_pgversion_env_var: "true"
enable_replica_load_balancer: "false"
enable_spilo_wal_path_compat: "true"
enable_team_member_deprecation: "false"
enable_teams_api: "false"
external_traffic_policy: "Cluster"
logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup:v1.6.3"
logical_backup_job_prefix: "logical-backup-"
logical_backup_provider: "s3"
logical_backup_s3_bucket: "my-bucket-url"
logical_backup_s3_sse: "AES256"
logical_backup_schedule: "30 00 * * *"
major_version_upgrade_mode: "manual"
master_dns_name_format: "{cluster}.{team}.{hostedzone}"
pdb_name_format: "postgres-{cluster}-pdb"
pod_deletion_wait_timeout: 10m
pod_label_wait_timeout: 10m
pod_management_policy: "ordered_ready"
pod_role_label: spilo-role
pod_service_account_name: "postgres-pod"
pod_terminate_grace_period: 10m
ready_wait_interval: 3s
ready_wait_timeout: 30s
repair_period: 5m
replica_dns_name_format: "{cluster}-repl.{team}.{hostedzone}"
replication_username: standby
resource_check_interval: 3s
resource_check_timeout: 10m
resync_period: 30m
ring_log_lines: "100"
role_deletion_suffix: "_deleted"
secret_name_template: "{username}.{cluster}.credentials"
spilo_allow_privilege_escalation: "true"
spilo_privileged: "false"
storage_resize_mode: "pvc"
super_username: postgres
watched_namespace: "*"
workers: "8"
@CyberDem0n
Also, I found many wired things during multiple time killing container (postgres) or network loss that it will not promote to other roles. Master will be master or slave will be slave only even there is disturbance.
The failure detection isn't instant. Please learn how Patroni works: https://github.com/patroni-training/2019 (there are also plenty material available on the internet on this topic)
Yes sure, I got it but still,
app-0
is master and app-1
is slave then network lost to the master(during this metrics will be non zero) now after some time it app-1
is master and app-0
is slave so now I again do network loss to master which is app-1
now it is possible that metrics will not come why? Seems that the application is not resilient sometimes: Downfall is network loss during the new master.
I can't explain your graph because I don't understand what you are doing. Also, I don't really want to spend my time trying to understand it. Most of your questions will be answered when you learn how Patroni works (and K8s Services).
Hey @CyberDem0n, I know k8s services and working of Patroni. Also know your time is important and thank you for replying but I found a couple of bugs that I wanted to share with you.
I have been doing chaos testing using LitmusChaos for our demo. We wanted to see that in resilient case like HA if the master fails there is a standby that will be taking care like what patroni do. And wanted to see monitoring it is actually happening or not but during this, I found sometimes different behavior.
But during this actually, an application behaving weak so if you are available for sometime we can have google meet or something where I will explain you the bugs. @CyberDem0n
We wanted to see that in resilient case like HA if the master fails there is a standby that will be taking care like what patroni do
It is not "like", there is Patroni inside.
If you think that found a bug, you should:
First two points are very important, they are about making sure that Patroni isn't doing what it is supposed to.
Yes, you are right @CyberDem0n , I went through it.
My point is it should behave the same as every time I inject network loss chaos for a couple of seconds. Sometimes it is working fine and sometimes not. If I kill container for multiple time then every time it should update themself.
Two services master and replica. So I have a load generator that is actually generating load without failure during network loss every sec. But why it is not reflecting in metrics.
Please understand what I exactly wanted to share. I went through zalando docs and pretty hands on with concepts.
it should behave the same as every time I inject network loss chaos
Welcome to the distributed world. It is not supposed to behave always the same.
I inject network loss chaos for a couple of seconds.
First of all, it is not clear what "chaos" is, second - if you literally mean two seconds, nothing should ever happen. Patroni is configured to be resilient to network problems that are shorter than 10s by default.
If I kill container for multiple time then every time it should update themself.
I don't get, who should update what.
We are running in circles, please provide a reproducable test case.
postgres
container kill for a couple of seconds pods are getting stuck to master/slave itself not promoting themselfWhile having HA master and slave we expected to have metrics in some proper way so that one can conclude in HA cluster queries are working properly. But by looking into grafana dashboard graph it is confusing. Attached both images.
After doing network loss against the master, Both becoming master, leader and follower. After completion of network loss it is regaining original role.
And after network loss is completed there is a sudden hike in metrics found. Running exporter as a
sidecars
: Prometheus query:Also found sometimes that master and slave both will not export metrics.
Manifest :