zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.53k stars 382 forks source link

Fix flaky tests #844

Closed mneverov closed 9 months ago

mneverov commented 1 year ago

Fix flaky tests

Previously, when running on GH test failed periodically.

  1. Upgrade failed when no backup existed (I don't have logs for that)
  2. Test upgrade when a DB contains a table with OIDs introduces such table, expects the test to fail, then drops the table. The CLONE_TARGET_TIME is defined to be 1 minute in the past, so if tests were quick enough the drop table with oids statement is absent in backup yet and hence the table is cloned into upgrade_container. Which leads to the:
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for system-defined composite types in user tables  ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for tables WITH OIDS                               fatal

Your installation contains tables declared WITH OIDS, which is not supported
anymore. Consider removing the oid column using
    ALTER TABLE ... SET WITHOUT OIDS;
A list of tables with the problem is in the file:
    tables_with_oids.txt

Failure, exiting
Traceback (most recent call last):
  File "/scripts/maybe_pg_upgrade.py", line 197, in <module>
    main()
  File "/scripts/maybe_pg_upgrade.py", line 167, in main
    raise Exception('Failed to upgrade cluster from {0} to {1}'.format(cluster_version, bin_version))
Exception: Failed to upgrade cluster from 11 to 12
2023-01-17 17:22:35,653 ERROR: /scripts/maybe_pg_upgrade.py script failed
2023-01-17 17:22:35,664 INFO: removing initialize key after failed attempt to bootstrap the cluster
2023-01-17 17:22:35,719 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2023-01-17-17-22-35
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 143, in main
    return patroni_main()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 135, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 105, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 108, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1537, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1399, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1291, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1284, in cancel_initialization
Error:     raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 120 seconds
Leader is not running after 120 seconds

Introducing waiting a backup fixed the first error, and dropping table with OIDs in upgrade_container fixed the second.

hughcapet commented 1 year ago

Thank you for the PR, much appreciated

The first problem is clearly a regression caused by this change. The only thing I would change is putting wait_backup after verify_archive_mode_is_on (as it used to be before the mentioned PR).

As for the second issue, I see the explanation slightly different. Clone target is actually set to +1 minute. Problems appear when the tests are slow enough to include creation and then not to include drop of the table_with_oids into the archive we use for the clone. Here I would rather try to avoid 'table with oids' making into clones at all. So maybe just delete it right after test_pg_upgrade_to_13_check_failed? And probably also execute wait_backup and wait_zero_lag before the clone which were also moved to the wrong place in the mentioned PR

mneverov commented 1 year ago

Thank you for the PR, much appreciated

The first problem is clearly a regression caused by this change. The only thing I would change is putting wait_backup after verify_archive_mode_is_on (as it used to be before the mentioned PR).

As for the second issue, I see the explanation slightly different. Clone target is actually set to +1 minute. Problems appear when the tests are slow enough to include creation and then not to include drop of the table_with_oids into the archive we use for the clone. Here I would rather try to avoid 'table with oids' making into clones at all. So maybe just delete it right after test_pg_upgrade_to_13_check_failed? And probably also execute wait_backup and wait_zero_lag before the clone which were also moved to the wrong place in the mentioned PR

Hi @hughcapet, thank you for the thorough review. I applied your suggestions, ptal.