Postgres scaffolding's method of wiping the database is slow

MaxGabriel commented 4 years ago

A long time back I added database wiping between tests to the scaffold. This is what I did for Postgres:

-- This function will truncate all of the tables in your database.
-- 'withApp' calls it before each test, creating a clean environment for each
-- spec to run in.
wipeDB :: App -> IO ()
wipeDB app = runDBWithApp app $ do
    tables <- getTables
    sqlBackend <- ask

    let escapedTables = map (connEscapeName sqlBackend . DBName) tables
        query = "TRUNCATE TABLE " ++ intercalate ", " escapedTables
    rawExecute query []

This works great, but as you get more and more tables it becomes pretty slow. At work, we have a test suite of 791 tests. Just marking every test as pending and wiping the database between tests was taking ~250 seconds!

Conversely, DELETEing all tables between every test takes us only 13.6 seconds.

The main downside to DELETE is that it causes issues with foreign key constraints. The workaround I came up with for this was having the test suite mark foreign keys as DEFERRABLE, then deferring checking constraints to the end of the transaction in the transaction that wiped the database.

Doing this is slightly complicated. There's also an edge case where DELETE triggers add values to other tables after you've DELETEd from them. For these reasons, I don't think it would be good to add complex DELETE code to the scaffolding.

I think a potentially good course of action is to just link to this issue in the scaffolding though (or a wiki page or something), warning users that if TRUNCATE ever becomes slow for them, they might consider switching to the DELETE shown below. I've copied this code from our codebase, so the comments are slightly specific to our codebase, but other than comments it can be dropped into any codebase:

import Database.Persist.Sql.Raw.QQ (executeQQ)
import Database.Persist.Sql (rawSql, Single)
import Database.Persist.Sql                 (SqlPersistM, runSqlPersistMPool)

-- | Helper function to run a database query with an App.
runDBWithApp :: App -> SqlPersistM a -> IO a
runDBWithApp app query = runSqlPersistMPool query (appConnPool app)

-- | Modifies the database and sets up functions for future calls to wipe it
setupFunctionsForClearingDatabase :: App -> IO ()
setupFunctionsForClearingDatabase app = do
  runDBWithApp app $ do

    --
    -- TRUNCATE vs DELETE strategy discussion
    --

    -- Previously to wipe the DB we made a call to TRUNCATE table1, table2, ... tableN, but this was quite slow
    -- Just running TRUNCATE between empty tests was 250 seconds, for 791 tests, on June 12, 2020
    -- vs DELETE at 13.6169 seconds
    -- (To test the cost of empty tests, export `it` as `xit` from TestImport to make all tests pending)

    -- Postgres docs suggest that TRUNCATE is faster than DELETE https://www.postgresql.org/docs/12/sql-truncate.html
    -- However, they note that TRUNCATE "reclaims disk space immediately"
    -- [1] Notes that TRUNCATE is effectively "DELETE FROM table" + "VACCUUM (FULL, ANALYZE) table;"
    -- And if you read through [2], you can see other work gets done too
    -- This all makes DELETE faster on mostly empty tables

    -- I didn't read through all of [2], but the Postgres people seemed mixed on wanting to solve this?
    -- People also proposed changing the documentation, not sure if that was rejected or just not done.

    -- [1] https://stackoverflow.com/a/11423886/1176156
    -- [2] Mailing list thread: https://www.postgresql.org/message-id/flat/4FFCCAC4.4030503%40ringerc.id.au

    --
    -- Mechanics of DELETE
    --

    -- An issue with DELETE is that we can't delete from multiple tables in a single command (afaik)
    -- So this would cause us to run into foreign key constraint errors if you deleted in the incorrect order
    -- To workaround this, I modify all constraints to be DEFERRABLE †
    -- Then before the actual DELETE calls, I defer constraints to be checked at the end of the transaction
    -- Another option would be to construct the correct order of DELETEs; not sure how tricky that would be.

    -- † This part doesn't actually change how the foreign keys work, it just gives you the *option* to change how they work in a transaction
    -- So actual test behavior is the same as prod

    -- Another issue with DELETE is that DELETE triggers will fire, and could populate tables that have already been deleted.

    --
    -- Other ideas
    --

    -- 1. Check if a table has data before clearing. I'm not sure this would help, but as a small example on empty tables in the test DB:
    -- EXPLAIN ANALYZE DELETE FROM plaid_category_hierarchy; took 1.9ms
    -- EXPLAIN ANALYZE SELECT COUNT(*) > 0 FROM plaid_historical_balances; took 3.3ms
    -- These figures are so small there's a lot of noise, so I expect the SELECT before DELETE would be a minor improvement at best
    -- We could also try this approach with TRUNCATE, just for completeness' sake

    -- 2. See if query plans are cached/we can cache them.
    -- If you look at a DELETE on an empty table, most of the time comes from planning
    -- Subsequent DELETEs are faster b/c the plan is cached, presumably
    -- It's unclear to me if this cached query plan is persisted between each test
    -- If not, improving caching of this could be a small speedup

{-
mercury-web-backend-test=# EXPLAIN ANALYZE DELETE FROM users;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Delete on users  (cost=0.00..14.10 rows=410 width=6) (actual time=0.004..0.004 rows=0 loops=1)
   ->  Seq Scan on users  (cost=0.00..14.10 rows=410 width=6) (actual time=0.003..0.003 rows=0 loops=1)
 Planning Time: 7.588 ms
 Execution Time: 0.093 ms
(4 rows)

mercury-web-backend-test=# EXPLAIN ANALYZE DELETE FROM users;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Delete on users  (cost=0.00..14.10 rows=410 width=6) (actual time=0.003..0.003 rows=0 loops=1)
   ->  Seq Scan on users  (cost=0.00..14.10 rows=410 width=6) (actual time=0.002..0.002 rows=0 loops=1)
 Planning Time: 0.032 ms
 Execution Time: 0.036 ms
(4 rows)

-}

    -- TODO: any reason to make this a function? Should I just execute it here?
    -- Copied partially from https://stackoverflow.com/a/10950402/1176156
    -- For deferring constraints, see https://stackoverflow.com/a/2681413/1176156
    -- I use the postgres-specific tables (e.g. pg_constraint) rather than information_schema
    -- Because information_schema includes multiple rows when a foreign key constraint is based on 2 columns
    -- (information_schema is meant to be standards compliant so it has to compromise accuracy)
    [executeQQ|
CREATE OR REPLACE FUNCTION f_deferrable_constraints()
  RETURNS void AS
$func$
DECLARE
   _schema text;
   _table text;
   _constraint text;

BEGIN
   FOR _schema, _table, _constraint IN
     SELECT pgns.nspname, pgcl.relname, pgco.conname
     FROM pg_constraint pgco
     JOIN pg_class pgcl ON pgcl.oid = pgco.conrelid
     JOIN pg_namespace pgns on pgns.oid = pgcl.relnamespace
     WHERE pgco.contype = 'f'
     AND pgns.nspname = 'public'
  LOOP
    EXECUTE format('ALTER TABLE %I.%I ALTER CONSTRAINT %I DEFERRABLE;', _schema, _table, _constraint);
  END LOOP;
END
$func$ LANGUAGE plpgsql;
    |]

    (_ :: [Single PersistValue]) <- rawSql "SELECT f_deferrable_constraints()" []

    -- Deferring constraints https://www.postgresql.org/docs/12/sql-set-constraints.html

    -- See "Mechanics of DELETE" above
    [executeQQ|
CREATE OR REPLACE FUNCTION f_delete_tables()
  RETURNS void AS
$func$
DECLARE
   _tbl text;
   _sch text;
BEGIN
   SET CONSTRAINTS ALL DEFERRED;
   FOR _sch, _tbl IN
      SELECT schemaname, tablename
      FROM   pg_tables
      WHERE  schemaname = 'public'
      ORDER BY tablename ASC
   LOOP
      EXECUTE format('DELETE FROM %I.%I', _sch, _tbl);
   END LOOP;
END
$func$ LANGUAGE plpgsql;|]

wipeDB :: App -> IO ()
wipeDB app = do
  runDBWithApp app $ do
    -- The f_delete_tables function is created in test/Main.hs
    (_ :: [Single PersistValue]) <- rawSql "SELECT f_delete_tables();" []
    pure ()

snoyberg commented 4 years ago

No strong opinions, docs along these lines sell reasonable to me

parsonsmatt commented 4 years ago

you may also want to consider rolling the transaction back - might be wonky with nested transactions, but this is by far the fastest thing we have used

MaxGabriel commented 4 years ago

Talked to Matt over Slack, but our tests (and the yesod-scaffold ones) are primarily integration style tests that test Handlers end-to-end, so they can't use the DB rollback strategy very easily I think

yesodweb / yesod-scaffold

Postgres scaffolding's method of wiping the database is slow #201