zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.55k stars 386 forks source link

Upgrade 12 to 13 fails #554

Closed Aqualie closed 3 years ago

Aqualie commented 3 years ago

Having issues with this script trying to upgrade from 12 to 13:

postgres@postgres-0:~/pgdata/pgroot$ patronictl list

+ Cluster: postgres (6836826234917412945) -----+-----+-----------+
| Member     | Host        | Role    | State   |  TL | Lag in MB |
+------------+-------------+---------+---------+-----+-----------+
| postgres-0 | REPLACED | Leader  | running | 302 |           |
| postgres-1 | REPLACED | Replica | running | 302 |         0 |
| postgres-2 | REPLACED | Replica | running | 302 |         0 |
+------------+-------------+---------+---------+-----+-----------+

postgres@postgres-0:~/pgdata/pgroot$ python3 /scripts/inplace_upgrade.py 3

2021-02-12 21:40:53,598 inplace_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-02-12 21:40:54,295 inplace_upgrade WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2021-02-12 21:40:54,298 inplace_upgrade INFO: establishing a new patroni connection to the postgres cluster
2021-02-12 21:40:54,859 inplace_upgrade INFO: Cluster postgres is ready to be upgraded
2021-02-12 21:40:54,862 inplace_upgrade INFO: initdb config: [{'locale': 'en_US.UTF-8'}, {'encoding': 'UTF8'}, 'data-checksums']
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

creating directory /home/postgres/pgdata/pgroot/data_new ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    /usr/lib/postgresql/13/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data_new -l logfile start

2021-02-12 21:41:07,397 inplace_upgrade INFO: Executing pg_upgrade --check
Performing Consistency Checks on Old Live Server
------------------------------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok
Checking for new cluster tablespace directories             ok

*Clusters are compatible*
2021-02-12 21:41:14,792 inplace_upgrade INFO: Dropping objects from the cluster which could be incompatible
2021-02-12 21:41:14,889 inplace_upgrade INFO: Executing "DROP FUNCTION metric_helpers.pg_stat_statements" in the database="postgres"
2021-02-12 21:41:14,892 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_kcache"
2021-02-12 21:41:14,898 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_statements"
2021-02-12 21:41:14,989 inplace_upgrade INFO: Executing "DROP EXTENSION IF EXISTS amcheck_next" in the database="postgres"
2021-02-12 21:41:14,990 inplace_upgrade INFO: Executing "DROP TABLE postgres_log CASCADE" in the database=postgres
2021-02-12 21:41:15,093 inplace_upgrade INFO: Executing "DROP FUNCTION metric_helpers.pg_stat_statements" in the database="template1"
2021-02-12 21:41:15,192 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_kcache"
2021-02-12 21:41:15,195 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_statements"
2021-02-12 21:41:15,197 inplace_upgrade INFO: Executing "DROP EXTENSION IF EXISTS amcheck_next" in the database="template1"
2021-02-12 21:41:15,293 inplace_upgrade INFO: Executing "DROP FUNCTION metric_helpers.pg_stat_statements" in the database="confluence"
2021-02-12 21:41:15,391 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_kcache"
2021-02-12 21:41:15,395 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_statements"
2021-02-12 21:41:15,488 inplace_upgrade INFO: Executing "DROP EXTENSION IF EXISTS amcheck_next" in the database="confluence"
2021-02-12 21:41:15,589 inplace_upgrade INFO: Executing "DROP FUNCTION metric_helpers.pg_stat_statements" in the database="atc"
2021-02-12 21:41:15,636 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_kcache"
2021-02-12 21:41:15,688 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_statements"
2021-02-12 21:41:15,691 inplace_upgrade INFO: Executing "DROP EXTENSION IF EXISTS amcheck_next" in the database="atc"
2021-02-12 21:41:15,787 inplace_upgrade INFO: Executing "DROP FUNCTION metric_helpers.pg_stat_statements" in the database="blacklist"
2021-02-12 21:41:15,849 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_kcache"
2021-02-12 21:41:15,928 inplace_upgrade INFO: Executing "DROP EXTENSION pg_stat_statements"
2021-02-12 21:41:15,952 inplace_upgrade INFO: Executing "DROP EXTENSION IF EXISTS amcheck_next" in the database="blacklist"
2021-02-12 21:41:15,954 inplace_upgrade INFO: Enabling maintenance mode
2021-02-12 21:41:17,138 inplace_upgrade INFO: Maintenance mode enabled
2021-02-12 21:41:17,139 inplace_upgrade INFO: Doing a clean shutdown of the cluster before pg_upgrade
2021-02-12 21:41:19,901 inplace_upgrade INFO: Latest checkpoint location: 187988705320
2021-02-12 21:41:19,902 inplace_upgrade INFO: Starting rsyncd
2021-02-12 21:41:20,008 inplace_upgrade INFO: Waiting for replica nodes to catch up with primary
2021-02-12 21:41:20,015 inplace_upgrade INFO: All replicas are ready
2021-02-12 21:41:20,015 inplace_upgrade INFO: Executing CHECKPOINT on replicas postgres-1,postgres-2
2021-02-12 21:41:20,089 inplace_upgrade INFO: Executing pg_upgrade
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok
Checking for new cluster tablespace directories             ok

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows in the new cluster                        ok
Deleting files from new pg_xact                             ok
Copying old pg_xact to new server                           ok
Setting next transaction ID and epoch for new cluster       ok
Deleting files from new pg_multixact/offsets                ok
Copying old pg_multixact/offsets to new server              ok
Deleting files from new pg_multixact/members                ok
Copying old pg_multixact/members to new server              ok
Setting next multixact ID and offset for new cluster        ok
Resetting WAL archives                                      ok
Setting frozenxid and minmxid counters in new cluster       ok
Restoring global objects in the new cluster                 ok
Restoring database schemas in the new cluster
  postgres
*failure*

Consult the last few lines of "pg_upgrade_dump_13398.log" for
the probable cause of the failure.
Failure, exiting
2021-02-12 21:42:41,797 inplace_upgrade ERROR: Failed to upgrade cluster from 12 to 13
2021-02-12 21:42:41,798 inplace_upgrade INFO: Stopping rsyncd
2021-02-12 21:42:41,799 inplace_upgrade INFO: Disabling maintenance mode
2021-02-12 21:42:42,988 inplace_upgrade INFO: Maintenance mode disabled

pg_upgrade_dump_13398.log

command: "/usr/lib/postgresql/13/bin/pg_dump" --host /home/postgres/pgdata/pgroot/data_upgrade --port 50432 --username postgres --schema-only
--quote-all-identifiers --binary-upgrade --format=custom  --file="pg_upgrade_dump_13398.custom" 'dbname=postgres' >> "pg_upgrade_dump_13398.lo
g" 2>&1

command: "/usr/lib/postgresql/13/bin/pg_restore" --host /home/postgres/pgdata/pgroot/data_upgrade --port 50432 --username postgres --clean --c
reate --exit-on-error --verbose --dbname template1 "pg_upgrade_dump_13398.custom" >> "pg_upgrade_dump_13398.log" 2>&1
pg_restore: connecting to database for restore
pg_restore: dropping DATABASE PROPERTIES postgres
pg_restore: dropping DATABASE postgres
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 3014; 1262 13398 DATABASE postgres postgres
pg_restore: error: could not execute query: ERROR:  database "postgres" is being accessed by other users
DETAIL:  There is 1 other session using the database.
Command was: DROP DATABASE "postgres";

postgres-# from pg_stat_activity where datname='postgres';

process_id | username | database_name | client_address | application_name  |         backend_start         | state  |         state_change   
------------+----------+---------------+----------------+-------------------+-------------------------------+--------+-------------------------------
       6533 | postgres | postgres      |                | Patroni           | 2021-02-12 21:42:49.389718+00 | idle   | 2021-02-12 21:50:54.498587+00
       6522 | postgres | postgres      |                |                   | 2021-02-12 21:42:48.495981+00 |        |
       6551 | postgres | postgres      |                | pg_cron scheduler | 2021-02-12 21:42:51.535057+00 | idle   | 2021-02-12 21:43:01.188565+00
       6822 | postgres | postgres      |                | psql              | 2021-02-12 21:49:57.388283+00 | active | 2021-02-12 21:51:04.284489+00
(4 rows)
CyberDem0n commented 3 years ago
pg_restore: dropping DATABASE postgres

Why is it trying to drop postgres database? Have you dropped and re-created it?

Aqualie commented 3 years ago

I have no idea I'm just running the upgrade script I'm not doing anything else. I haven't touched the postgres database since I deployed spilo long time ago.

CyberDem0n commented 3 years ago

Maybe there is something special about this cluster? Could you please show the content of postgres.yml file and the patronictl show-config output?

Aqualie commented 3 years ago

I don't remember creating this cluster with any special configs/options, output below.

kc exec -t postgres-0 -- patronictl show-config

loop_wait: 10
maximum_lag_on_failover: 1048576
pause: false
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    max_connections: 100
    max_replication_slots: 10
    max_wal_senders: 10
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_functions: all
    wal_keep_segments: 8
    wal_level: hot_standby
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

kubectl exec -t postgres-0 -- cat /home/postgres/postgres.yml

bootstrap:
  dcs:
    loop_wait: 10
    maximum_lag_on_failover: 33554432
    postgresql:
      parameters:
        archive_mode: 'on'
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 5
        autovacuum_vacuum_scale_factor: 0.05
        checkpoint_completion_target: 0.9
        hot_standby: 'on'
        log_autovacuum_min_duration: 0
        log_checkpoints: 'on'
        log_connections: 'on'
        log_disconnections: 'on'
        log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
        log_lock_waits: 'on'
        log_min_duration_statement: 500
        log_statement: ddl
        log_temp_files: 0
        max_connections: 100
        max_replication_slots: 10
        max_wal_senders: 10
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 100
        track_functions: all
        wal_level: hot_standby
        wal_log_hints: 'on'
      use_pg_rewind: true
      use_slots: true
    retry_timeout: 10
    ttl: 30
  initdb:
  - auth-host: md5
  - auth-local: trust
  post_init: /scripts/post_init.sh "zalandos"
  users:
    admin:
      options:
      - createrole
      - createdb
      password: REPLACED
kubernetes:
  bypass_api_service: true
  labels:
    app: postgres
  pod_ip: REPLACED
  ports:
  - name: postgresql
    port: 5432
  role_label: role
  scope_label: spilo-cluster
  use_endpoints: true
namespace: database
postgresql:
  authentication:
    replication:
      password: REPLACED
      username: repluser
    superuser:
      password: REPLACED
      username: postgres
  basebackup_fast_xlog:
    command: /scripts/basebackup.sh
    retries: 2
  bin_dir: /usr/lib/postgresql/12/bin
  callbacks:
    on_role_change: /scripts/on_role_change.sh zalandos true
  connect_address: REPLACED:5432
  create_replica_method:
  - basebackup_fast_xlog
  data_dir: /home/postgres/pgdata/pgroot/data
  listen: '*:5432'
  name: postgres-0
  parameters:
    archive_command: /bin/true
    bg_mon.history_buckets: 120
    bg_mon.listen_address: 0.0.0.0
    extwlist.custom_path: /scripts
    extwlist.extensions: btree_gin,btree_gist,citext,hstore,intarray,ltree,pgcrypto,pgq,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,hypopg,timescaledb,pg_partman
    log_destination: csvlog
    log_directory: ../pg_log
    log_file_mode: '0644'
    log_filename: postgresql-%u.log
    log_rotation_age: 1d
    log_truncate_on_rotation: 'on'
    logging_collector: 'on'
    pg_stat_statements.track_utility: 'off'
    shared_buffers: 188MB
    shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache
    ssl: 'on'
    ssl_cert_file: /run/certs/server.crt
    ssl_key_file: /run/certs/server.key
  pg_hba:
  - local   all             all                                   trust
  - hostssl all             +zalandos    127.0.0.1/32       pam
  - host    all             all                127.0.0.1/32       md5
  - hostssl all             +zalandos    ::1/128            pam
  - host    all             all                ::1/128            md5
  - hostssl replication     repluser all                md5
  - hostnossl all           all                all                reject
  - hostssl all             +zalandos    all                pam
  - hostssl all             all                all                md5
  pgpass: /run/postgresql/pgpass
  use_unix_socket: true
restapi:
  connect_address: REPLACED:8008
  listen: :8008
scope: postgres
CyberDem0n commented 3 years ago

Yes, interesting. I suspect that bg_mon could be a culprit. https://github.com/zalando/spilo/pull/553 contains a possible fix.

Aqualie commented 3 years ago

Got it let me switch to this tree and give it a try. I'll update with my results soon.

Aqualie commented 3 years ago

Upgrade completed successfully no issues and everything is running 👌