zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.34k stars 979 forks source link

One instance of Postgresql was crushed - psql #889

Closed wolacinio closed 10 months ago

wolacinio commented 4 years ago

I have two instance PostgreSQL Cluster. I created service in Kubernetes to connect with PostgreSQL master instance. I run this command on the master node machine: /usr/pgsql-11/bin/psql -d sample -h 10.132.54.131 -p 30010 -U postgres -f dump.sql and i received:

psql:dump.sql:91600: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. psql:dump.sql:91600: SSL SYSCALL error: EOF detected psql:dump.sql:91600: connection to server was lost

kubectl logs -f postgres-0 - logs master

2020-03-29 17:58:43,602 INFO: no action. i am the leader with the lock 2020-03-29 17:58:48,581 INFO: Lock owner: czk-postgres-0; I am czk-postgres-0 2020-03-29 17:58:48,605 INFO: no action. i am the leader with the lock 2020-03-29 17:58:51.183 36 LOG {ticks: 0, maint: 0, retry: 0} 2020-03-29 17:58:53,582 INFO: Lock owner: czk-postgres-0; I am czk-postgres-0 2020-03-29 17:58:57,599 ERROR: failed to update leader lock 2020-03-29 17:58:58,008 INFO: demoted self because failed to update leader lock in DCS 2020-03-29 17:58:58,376 INFO: closed patroni connection to the postgresql cluster 2020-03-29 17:58:58,577 INFO: Lock owner: czk-postgres-0; I am czk-postgres-0 2020-03-29 17:58:58,602 INFO: updated leader lock during starting after demotion 2020-03-29 17:58:58,753 INFO: postmaster pid=2515 2020-03-29 17:58:58 UTC [2515]: [1-1] 5e80e1e2.9d3 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2020-03-29 17:58:58 UTC [2515]: [2-1] 5e80e1e2.9d3 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2020-03-29 17:58:58 UTC [2515]: [3-1] 5e80e1e2.9d3 0 LOG: listening on IPv4 address "0.0.0.0", port 5432 2020-03-29 17:58:58 UTC [2515]: [4-1] 5e80e1e2.9d3 0 LOG: listening on IPv6 address "::", port 5432 /var/run/postgresql:5432 - no response 2020-03-29 17:58:58 UTC [2515]: [5-1] 5e80e1e2.9d3 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL .5432" 2020-03-29 17:58:59 UTC [2515]: [6-1] 5e80e1e2.9d3 0 LOG: redirecting log output to logging collector process 2020-03-29 17:58:59 UTC [2515]: [7-1] 5e80e1e2.9d3 0 HINT: Future log output will appear in directory "../pg_log ". /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections 2020-03-29 17:59:03,577 INFO: Lock owner: czk-postgres-0; I am czk-postgres-0 2020-03-29 17:59:03,584 INFO: updated leader lock during starting after demotion /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections 2020-03-29 17:59:08,577 INFO: Lock owner: czk-postgres-0; I am czk-postgres-0 2020-03-29 17:59:08,584 INFO: updated leader lock during starting after demotion

Sometimes dump was interrupted earlier or later.

root@czk-postgres-1:/home/postgres# patronictl -c postgres.yml list

+--------------+----------------+---------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +--------------+----------------+---------------+--------+---------+----+-----------+ | czk-postgres | czk-postgres-0 | 10.244.165.80 | Leader | running | 4 | | | czk-postgres | czk-postgres-1 | 10.244.95.1 | | running | | 2 | +--------------+----------------+---------------+--------+---------+----+-----------+

I introduced more memory but it still doesn't work: image

gobadiah commented 4 years ago

@wolacinio have you figured out what was the cause of the issue ? I'm experiencing a similar problem.

wolacinio commented 4 years ago

@gobadiah it seems to me the main problem was in Kubernetes API or Kubernetes Proxy. I reinstalled Kubernetes.

FxKu commented 4 years ago

@wolacinio The operator creates a service for you. You don't have to do it yourself. Would be interesting to know, which resources you have defined in the postgres manifest. Maybe it's not enough. Have you tried using the logical backup feature, yet? I would not dump on the instance itself.

wolacinio commented 4 years ago

@FxKu I know that. Logical backup run only docker image, execute backup and stops. It works this same as run psql local. I think that it doesn't matter how connecting with PostgreSQL service. /usr/pgsql-11/bin/psql -d sample -h 10.132.54.131 -p 30010 -U postgres -f dump.sql My way works fine in another clients. Postgres-manifets.yml:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: czk-postgres

spec:
  dockerImage: registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p16
  teamId: "czk"
  volume:
    size: 10Gi
  storageClass: gluster-heketi-external
  numberOfInstances: 1
  users: {} # Application/Robot users
  enableMasterLoadBalancer: false
  enableReplicaLoadBalancer: false
  allowedSourceRanges:  # load balancers' source ranges for both master and replica services
  - 127.0.0.1/32
  databases: {}
  podAnnotations: {}

# Expert section

  enableShmVolume: true
# spiloFSGroup: 103
  postgresql:
    version: "11"
    parameters:
      max_connections: "100"
  resources:
    requests:
      cpu: 10m
      memory: 200Mi
    limits:
      cpu: 500m
      memory: 500Mi
  patroni:
    initdb:
      encoding: "UTF8"
      locale: "en_US.UTF-8"
      data-checksums: "true"
    pg_hba:
    - hostssl all all 0.0.0.0/0 trust
    - host    all all 0.0.0.0/0 trust
#   slots:
#     - permanent_physical_1:
#         type: physical
#     - permanent_logical_1:
#         type: logical
#         database: foo
#         plugin: pgoutput
    ttl: 30
    loop_wait: &loop_wait 5
    retry_timeout: 10
    maximum_lag_on_failover: 33554432
vitobotta commented 4 years ago

Hi @FxKu just noticed these warnings in my app's log too for the first time - not sure if it has happened before. I have set limits to 1 core and 1GB, and there is very little activity (just me since I haven't launched my app yet). Could it happen because when I deploy the workers the previous workers get terminated? Or is it something in Postgres itself, not related to the app that uses it? Thanks!

Deepanshu6dec commented 2 years ago

This issue is still open hence writing it Even after providing enough memory and CPU still Running pod failed and the last logs is:

For 1 hours its working and after some time its went to failed state.

2022-07-27 16:25:36,747 INFO: Lock owner: None; I am smsc-cluster-0 2022-07-27 16:25:36,748 INFO: starting as a secondary 2022-07-27 16:25:37,170 INFO: postmaster pid=13271 /var/run/postgresql:5432 - no response 2022-07-27 16:25:37 UTC [13271]: [1-1] 62e16701.33d7 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2022-07-27 16:25:37 UTC [13271]: [2-1] 62e16701.33d7 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2022-07-27 16:25:37 UTC [13271]: [3-1] 62e16701.33d7 0 LOG: redirecting log output to logging collector process 2022-07-27 16:25:37 UTC [13271]: [4-1] 62e16701.33d7 0 HINT: Future log output will appear in directory "../pg_log". 2022-07-27 16:25:38,230 ERROR: postmaster is not running

Jeansen commented 2 years ago

Same here. Just created the operator and simple minimal cluster by the Readme. The use case I am testing is uploading a 50M binary file to a dummy table. That is, I connect to the postgres service (type: LoadBalancer), create a table with one columnd of type bytea and then I try to uplaod that file. But some seconds later, I get:

SSL SYSCALL error: EOF detected The connection to the server was lost. Attempting reset: Failed.

Deepanshudang commented 2 years ago

Please increase the resources limit and request in postgresql manifest file

Jeansen commented 2 years ago

@Deepanshudang Increasing the resource helped, but it took me a while to find the right settings. Any documentation on this, how to set the value? We have the case that we also store binary files of varying sizes and I have absolutely no idea if this could happen again with bigger files. And simply saying take what you get from the node is not applicable.

FxKu commented 10 months ago

Whenever one enters the postgresql and patroni manifest sections overriding config parameters you have to know what you're doing. We are not guaranteeing that the system will run properly with your custom settings. Maybe it was even wrong to enable these config means after all :smiley: . On the other hand people are often hesitant patching Spilo to their own needs.