[YSQL] Very low performance in graph traversal with recursive CTEs

St4NNi commented 1 year ago

Jira Link: DB-7520

Description

Hi everyone,

in our performance evaluation of Yugabyte versus plain Postgres and other competitors we experienced very low performance with recursive CTEs on larger tables. The test is derived by your examples on graph traversal. For testing purposes this was done on a single node docker instance, but a quick test on a multi-node setup revealed similar results.

Setup

The minimal setup consists of a directed acyclic graph represented by an edges table:

CREATE TABLE edges (
    id UUID PRIMARY KEY NOT NULL,
    origin UUID,
    target UUID,
);
CREATE INDEX ot_idx ON edges (origin, target);
CREATE INDEX to_idx ON edges (target, origin);

The graph has a maximum depth of 4 layers (3 edges) and consists of approximately 2,2 Million rows mixed between one, two and three edges.

Problem

The problem arises when these paths are queried via recursive CTEs like so:

WITH RECURSIVE paths AS (
    SELECT origin, target
    FROM edges WHERE target_pid = $1 
  UNION ALL 
    SELECT e.origin, e.target
    FROM paths, edges e WHERE e.target = paths.origin
) SELECT * FROM paths;

I expect this to use the indices to speed up the query and in fact if I dump this exact setup on a vanilla postgres instance this is the case:

`EXPLAIN ANALYZE` postgres

CTE Scan on paths  (cost=454.22..456.24 rows=101 width=32) (actual time=0.035..0.071 rows=3 loops=1)
   CTE paths
     ->  Recursive Union  (cost=0.43..454.22 rows=101 width=32) (actual time=0.034..0.070 rows=3 loops=1)
           ->  Index Only Scan using to_idx on edges  (cost=0.43..4.45 rows=1 width=32) (actual time=0.033..0.033 rows=1 loops=1)
                 Index Cond: (target_pid = '<SOME-PID>'::uuid)
                 Heap Fetches: 0
           ->  Nested Loop  (cost=0.43..44.77 rows=10 width=32) (actual time=0.011..0.011 rows=1 loops=3)
                 ->  WorkTable Scan on paths paths_1  (cost=0.00..0.20 rows=10 width=16) (actual time=0.000..0.000 rows=1 loops=3)
                 ->  Index Only Scan using to_idx on edges e  (cost=0.43..4.45 rows=1 width=32) (actual time=0.009..0.010 rows=1 loops=3)
                       Index Cond: (e.target = paths_1.origin)
                       Heap Fetches: 0
 Planning Time: 0.243 ms
 Execution Time: 0.110 ms

The same query on Yugabyte however takes forever because it uses multiple seq scans instead of the index:

`EXPLAIN ANALYZE` yugabyte

 CTE Scan on paths  (cost=4195826.03..4636656.69 rows=22041533 width=32) (actual time=0.494..15469.336 rows=3 loops=1)
   CTE paths
     ->  Recursive Union  (cost=0.00..4195826.03 rows=22041533 width=32) (actual time=0.493..15469.331 rows=3 loops=1)
           ->  Index Only Scan using to_idx on edges  (cost=0.00..24555.24 rows=218233 width=32) (actual time=0.491..0.492 rows=1 loops=1)
                 Index Cond: (target = '<SOME-PID>'::uuid)
                 Heap Fetches: 0
           ->  Hash Join  (cost=260431.46..373044.01 rows=2182330 width=32) (actual time=5150.801..5156.240 rows=1 loops=3)
                 Hash Cond: (paths_1.origin = e.target)
                 ->  WorkTable Scan on paths paths_1  (cost=0.00..43646.60 rows=2182330 width=16) (actual time=0.001..0.001 rows=1 loops=3)
                 ->  Hash  (cost=218233.30..218233.30 rows=2182333 width=32) (actual time=5138.620..5138.620 rows=2182333 loops=3)
                       Buckets: 65536  Batches: 64  Memory Usage: 2635kB
                       ->  Seq Scan on edges e  (cost=0.00..218233.30 rows=2182333 width=32) (actual time=2.339..4740.180 rows=2182333 loops=3)
 Planning Time: 0.134 ms
 Execution Time: 15469.442 ms
 Peak Memory Usage: 5239 kB

Are we doing something wrong here ? Is there a chance to speed this up, or is this a known issue with Yugabyte ? A quick search revealed nothing useful.

Thanks in advance!

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

ddorian commented 1 year ago

CREATE INDEX ot_idx ON edges (origin, target); CREATE INDEX to_idx ON edges (origin, target);

These indexes are the same. Can you do \d+ edges on both dbs?

What version of PostgreSQL are you using? We're using 11.x and might miss the optimization on CTEs.

Also, can you explain what is your workload expected in production? You're doing full-table scans in this case, which is an OLAP scenario, which isn't a very good use-case for YugabyteDB https://docs.yugabyte.com/preview/faq/general/#when-is-yugabytedb-not-a-good-fit.

We specialize more in OLTP.

St4NNi commented 1 year ago

Sorry there was a copy paste error, both indices are the reverse of each other:

\d+ edge postgres:

postgres=# \d+ edges;
                                               Table "public.edges"
    Column     |          Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
---------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 id            | uuid                   |           | not null |         | plain    |             |              | 
 origin    | uuid                   |           |          |         | plain    |             |              | 
 target    | uuid                   |           |          |         | plain    |             |              | 
Indexes:
    "edges_pkey" PRIMARY KEY, btree (id)
    "to_idx" btree (target, origin)
    "ot_idx" btree (origin, target)
Access method: heap

\d+ edge yugabyte:

test=# \d+ edges;
                                        Table "public.edges"
    Column     |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
---------------+------------------------+-----------+----------+---------+----------+--------------+-------------
 id            | uuid                   |           | not null |         | plain    |              | 
 origin    | uuid                   |           |          |         | plain    |              | 
 target    | uuid                   |           |          |         | plain    |              | 
Indexes:
    "edges_pkey" PRIMARY KEY, lsm (id HASH)
    "to_idx" lsm (target HASH, origin ASC)
    "ot_idx" lsm (origin HASH, target ASC)

And yes I was using postgres 15.3 for comparison.

Regarding our use-case: we are developing a distributed scientific data storage system that has high requirements on consistency across multiple locations. The edges table describes internal relations between the data objects and their hierarchy (think folders in file system terms). Data itself is stored externally on a multitude of different providers (S3, filesystem etc.), but we need a database to track all binary data objects and all associated meta information. The expected workload is to query the full hierarchy for each object on creation because parents might need to be notified and/or actions need to be triggered. This is not really latency critical, since data upload takes almost always the majority of time, but ensuring consistency between all locations is crucial and a difference between a few milliseconds and seconds would be a real deal-breaker for us.

FranckPachot commented 1 year ago

@St4NNi Can you test the performance by forcing IndexScan and Batched Nested Loop:

explain analyze /*+ indexscan(e) set(yb_bnl_batch_size 1024) */
WITH RECURSIVE paths AS (
    SELECT origin, target
    FROM edges WHERE target = $1
  UNION ALL
    SELECT e.origin, e.target
    FROM paths, edges e WHERE e.target = paths.origin
) SELECT * FROM paths;

And a question, is (origin, target) unique? Then maybe better to define the table as:

CREATE TABLE edges (
    id UUID UNIQUE NOT NULL,
    origin UUID,
    target UUID,
    primary key(target, origin)
);

And for this query you don't need any other index. You can still create ot_idx if you need to. In YugabyteDB, the access by the primary key is faster as it doesn't need additional hops to the table

St4NNi commented 1 year ago

Hey, thank you very much for your speedy response:

Seems like forcing IndexScan and Batched Nested Loops did the trick:

 CTE Scan on paths  (cost=60091411873.62..60091852704.28 rows=22041533 width=32) (actual time=0.700..2.409 rows=3 loops=1)
   CTE paths
     ->  Recursive Union  (cost=0.00..60091411873.62 rows=22041533 width=32) (actual time=0.699..2.407 rows=3 loops=1)
           ->  Index Only Scan using to_idx on edges  (cost=0.00..24555.24 rows=218233 width=32) (actual time=0.698..0.699 rows=1 loops=1)
                 Index Cond: (target = '<ID>'::uuid)
                 Heap Fetches: 0
           ->  YB Batched Nested Loop Join  (cost=0.00..6009094648.77 rows=2182330 width=32) (actual time=0.564..0.565 rows=1 loops=3)
                 Join Filter: (paths_1.origin = e.target)
                 ->  WorkTable Scan on paths paths_1  (cost=0.00..43646.60 rows=2182330 width=16) (actual time=0.000..0.000 rows=1 loops=3)
                 ->  Index Scan using to_idx on edges e  (cost=0.00..26187.99 rows=218233 width=32) (actual time=0.526..0.526 rows=1 loops=3)
                       Index Cond: (target = ANY (ARRAY[paths_1.origin, $2, $3, ..., $1024]))
 Planning Time: 0.567 ms
 Execution Time: 2.546 ms
 Peak Memory Usage: 568 kB
(14 rows)

Which is way more in line with our expectations. Unfortunately target and origin are not always unique in our production setup (because we do have different edge types). But a few ms is more than enough for us. Thanks !

P.S.: Taking a closer look at the cost predictions, this seems like a little odd -> cost=60091411873.62..60091852704.28 are absurdly huge numbers and may explain why this path was not chosen by the optimizer.

ddorian commented 1 year ago

(because we do have different edge types)

Why not include an edge_type then in the primary-key? Assuming it's also used on queries.

ddorian commented 11 months ago

Closing as fixed after no response from user.

yugabyte / yugabyte-db