yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.72k stars 1.05k forks source link

Foreign key creation slow in pgbench #2066

Open nspiegelberg opened 5 years ago

nspiegelberg commented 5 years ago

Jira Link: DB-2558 I'm trying to do some preliminary pgbench timing on my laptop. All the steps run decently fast except for creating a foreign key (10+ min vs 10 sec). By default, pgbench runs this after populating the DB tables. Below is the runtime of generating 100k rows (the minimum amount). I had to CTRL+C out after 13 min:

Nicolass-MacBook-Pro.local:~/code/yugabyte-db [pgbench_alter L|…6] 13:30 $ time ./build/latest/postgres/bin/pgbench -i -Id dropping old tables... done.

real 0m9.873s user 0m0.002s sys 0m0.004s Nicolass-MacBook-Pro.local:~/code/yugabyte-db [pgbench_alter L|…6] 13:30 $ time ./build/latest/postgres/bin/pgbench -i -Itp creating tables (with primary keys)... done.

real 0m2.153s user 0m0.002s sys 0m0.004s Nicolass-MacBook-Pro.local:~/code/yugabyte-db [pgbench_alter L|…6] 13:31 $ time ./build/latest/postgres/bin/pgbench -i -Ig generating data... 100000 of 100000 tuples (100%) done (elapsed 1.78 s, remaining 0.00 s) done.

real 0m11.028s user 0m0.019s sys 0m0.005s 13:31 $ time ./build/latest/postgres/bin/pgbench -i -If creating foreign keys... ^C

real 13m5.264s user 0m0.002s sys 0m0.004s

If you want to run the last part in ysqlsh for debugging, the commands that pgbench runs for generating foreign keys is

alter table pgbench_tellers add constraint pgbench_tellers_bid_fkey foreign key (bid) references pgbench_branches; alter table pgbench_accounts add constraint pgbench_accounts_bid_fkey foreign key (bid) references pgbench_branches; alter table pgbench_history add constraint pgbench_history_bid_fkey foreign key (bid) references pgbench_branches; alter table pgbench_history add constraint pgbench_history_tid_fkey foreign key (tid) references pgbench_tellers; alter table pgbench_history add constraint pgbench_history_aid_fkey foreign key (aid) references pgbench_accounts;

nspiegelberg commented 5 years ago

Note that I ran YugaByte in serializable mode:

$ ysqlsh
postgres=# alter database postgres set default_transaction_isolation=serializable;
ALTER DATABASE
postgres=# select current_setting('transaction_isolation');
current_setting
-----------------
serializable
(1 row)
nspiegelberg commented 5 years ago

Similar stalls occur in the Generation pathway if adding a foreign key to the tables before populating them with data:

20:13 $ time ./build/latest/postgres/bin/pgbench -i -Id dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping done.

real 0m0.098s user 0m0.002s sys 0m0.005s Nicolass-MacBook-Pro.local:~/code/yugabyte-db [pgbench_alter L|✚ 1…6] 20:13 $ time ./build/latest/postgres/bin/pgbench -i -Itp creating tables (with primary keys)... done.

real 0m1.487s user 0m0.002s sys 0m0.004s Nicolass-MacBook-Pro.local:~/code/yugabyte-db [pgbench_alter L|✚ 1…6] 20:13 $ time ./build/latest/postgres/bin/pgbench -i -If creating foreign keys... done.

real 0m2.432s user 0m0.002s sys 0m0.004s Nicolass-MacBook-Pro.local:~/code/yugabyte-db [pgbench_alter L|✚ 1…6] 20:13 $ time ./build/latest/postgres/bin/pgbench -i -Ig generating data... 100000 of 100000 tuples (100%) done (elapsed 1.79 s, remaining 0.00 s) ^C

real 14m58.785s user 0m0.024s sys 0m0.006s

FranckPachot commented 9 months ago

Can we improve the validation of Foreign Key? I don't know what is the execution plan but there are probably better ones, especially with Batched Nested Loops.

Example:

drop table if exists t1, t2;
create table t1 ( id int primary key );
create table t2 ( id int );
insert into t1 select generate_series(1,100000);
insert into t2 select generate_series(1,100000);

\timing on
alter table t2 add constraint fk_t2_t2 foreign key (id) references t1 ;
ALTER TABLE
Time: 1137.127 ms (00:01.137)

With Batched Nested Loop:

yugabyte=# explain (analyze, dist off, debug off, costs off)
yugabyte-# /*+ Set(yb_bnl_batch_size 1024) */
yugabyte-# select id from t2 where not exists ( select id from t1 where t1.id=t2.id) limit 1;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual time=840.053..840.053 rows=0 loops=1)
   ->  YB Batched Nested Loop Anti Join (actual time=840.052..840.052 rows=0 loops=1)
         Join Filter: (t1.id = t2.id)
         ->  Seq Scan on t2 (actual time=0.906..114.941 rows=100000 loops=1)
         ->  Index Scan using t1_pkey on t1 (actual time=2.719..6.156 rows=1020 loops=98)
               Index Cond: (id = ANY (ARRAY[t2.id, $1, $2, ..., $1023]))
 Planning Time: 0.521 ms
 Execution Time: 840.226 ms
 Peak Memory Usage: 1909 kB
(9 rows)

Time: 853.683 ms

Hash Join:

yugabyte=# explain (analyze, dist off, debug off, costs off)
yugabyte-# /*+ HashJoin(t1 t2) */
yugabyte-# select id from t2 where not exists ( select id from t1 where t1.id=t2.id) limit 1;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit (actual time=297.468..297.469 rows=0 loops=1)
   ->  Hash Anti Join (actual time=297.466..297.467 rows=0 loops=1)
         Hash Cond: (t2.id = t1.id)
         ->  Seq Scan on t2 (actual time=1.060..110.174 rows=100000 loops=1)
         ->  Hash (actual time=134.021..134.021 rows=100000 loops=1)
               Buckets: 131072 (originally 1024)  Batches: 2 (originally 1)  Memory Usage: 3073kB
               ->  Seq Scan on t1 (actual time=0.696..103.604 rows=100000 loops=1)
 Planning Time: 0.208 ms
 Execution Time: 297.558 ms
 Peak Memory Usage: 5514 kB
(10 rows)

Time: 310.526 ms

Merge Join:

yugabyte=# explain (analyze, dist off, debug off, costs off)
yugabyte-# /*+ MergeJoin(t1 t2) */
yugabyte-# select id from t2 where not exists ( select id from t1 where t1.id=t2.id) limit 1;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Limit (actual time=286.712..286.712 rows=0 loops=1)
   ->  Merge Anti Join (actual time=286.711..286.711 rows=0 loops=1)
         Merge Cond: (t2.id = t1.id)
         ->  Sort (actual time=111.885..121.411 rows=100000 loops=1)
               Sort Key: t2.id
               Sort Method: external merge  Disk: 1376kB
               ->  Seq Scan on t2 (actual time=1.186..65.138 rows=100000 loops=1)
         ->  Sort (actual time=106.148..115.350 rows=100000 loops=1)
               Sort Key: t1.id
               Sort Method: external merge  Disk: 1376kB
               ->  Seq Scan on t1 (actual time=1.414..60.396 rows=100000 loops=1)
 Planning Time: 0.141 ms
 Execution Time: 287.251 ms
 Peak Memory Usage: 9857 kB
(14 rows)

Time: 299.934 ms

(and it can be even much faster if we have a ASC/DESC index on the foreign key

With except

yugabyte=# explain (analyze, dist off, debug off, costs off)
yugabyte-# select id from t2 except select id from t1 limit 1;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit (actual time=412.224..412.224 rows=0 loops=1)
   ->  HashSetOp Except (actual time=412.222..412.222 rows=0 loops=1)
         ->  Append (actual time=1.103..318.111 rows=200000 loops=1)
               ->  Subquery Scan on "*SELECT* 1" (actual time=1.102..137.769 rows=100000 loops=1)
                     ->  Seq Scan on t2 (actual time=1.100..114.571 rows=100000 loops=1)
               ->  Subquery Scan on "*SELECT* 2" (actual time=0.709..138.445 rows=100000 loops=1)
                     ->  Seq Scan on t1 (actual time=0.707..115.250 rows=100000 loops=1)
 Planning Time: 0.068 ms
 Execution Time: 413.669 ms
 Peak Memory Usage: 11392 kB
(10 rows)

Time: 426.112 ms

All are faster than the validation done by ADD CONSTRAINT If it cannot be cost based, my preference goes to Merge Join as there are good chances that there's an index on the foreign key