yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.93k stars 1.06k forks source link

[YSQL] Scan slower when using yb_hash_code based predicate #11506

Closed kmuthukk closed 2 years ago

kmuthukk commented 2 years ago

Jira Link: DB-1003

Description

The hash space for hash partitioned tables is [0..64k). When executing a query using predicate sayyb_hash_code(pk) < 16384 to query 1/4th of the table, the query run-time doesn't seem to be about 1/4th of the equivalent query (i.e. without the predicate) as one would expect.

Load script: https://gist.github.com/kmuthukk/4c7a7e77be023e80be08b838544903b8 Env: RF=1 yb-ctl based setup on my Linux dev server (4-core machine); 2m rows loaded. YugabyteDB version: 2.11.1

Query on full data set takes about 15sec (see below); but query using yb_hash_code(objectid) < 16384 takes 8.5seconds.. (rather than about 4.5secs.)

yugabyte=#  select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2)))
from fires
yugabyte-# group by year order by year;

 year |       sum
------+-----------------
 2019 | 375616250000.00
 2020 | 250616500000.00
 2021 | 125616750000.00
 2022 |    617000000.00
(4 rows)

Time: 15554.610 ms (00:15.555)
yugabyte=# select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2)))
from fires
where yb_hash_code(objectid) < 16384
yugabyte=# yugabyte-# yugabyte-# yugabyte-# group by year order by year;
 year |      sum
------+----------------
 2019 | 94065761133.00
 2020 | 62882155152.00
 2021 | 31392626780.00
 2022 |   154509140.00
(4 rows)

Time: 8585.179 ms (00:08.585)

The rows in the table are uniformly distributed; i.e. the range we are interested in does have about 1/4th the rows (500K of 2M rows).

yugabyte=# select count(*) from fires where yb_hash_code(objectid) < 16384;
 count
--------
 500959
(1 row)

yugabyte=# select count(*) from fires;
  count
---------
 2000000
(1 row)

The plan for one shows "SeqScan" vs. "Index Scan" for the other.. but both should essentially be scans on the primary key (PK), because in YugabyteDB the table is organized on disk in PK order in DocDB/RocksDB files.

yugabyte=# explain select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2)))
from fires
group by year order by year;

yugabyte=# yugabyte=# yugabyte-# yugabyte-#                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Sort  (cost=135.14..135.64 rows=200 width=52)
   Sort Key: ((date_part('year'::text, (discovery_date_d)::timestamp without time zone))::character varying(4))
   ->  HashAggregate  (cost=122.50..127.50 rows=200 width=52)
         Group Key: (date_part('year'::text, (discovery_date_d)::timestamp without time zone))::character varying(4)
         ->  Seq Scan on fires  (cost=0.00..112.50 rows=1000 width=28)
(5 rows)

Time: 0.898 ms
yugabyte=#
yugabyte=# yugabyte=# explain select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2)))
from fires
where yb_hash_code(objectid) < 16384
group by year order by year;

yugabyte-# yugabyte-# yugabyte-#                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Sort  (cost=53.17..53.67 rows=200 width=52)
   Sort Key: ((date_part('year'::text, (discovery_date_d)::timestamp without time zone))::character varying(4))
   ->  HashAggregate  (cost=40.53..45.53 rows=200 width=52)
         Group Key: (date_part('year'::text, (discovery_date_d)::timestamp without time zone))::character varying(4)
         ->  Index Scan using fires_pkey on fires  (cost=0.00..37.87 rows=266 width=28)
               Index Cond: (yb_hash_code(objectid) < 16384)
(6 rows)

Time: 0.833 ms
kmuthukk commented 2 years ago

@m-iancu wrote << I do remember some subtle differences between primary-key index and seq scan is regarding which columns we read (I think the pkey scan reads/computes the ybctid but the seq scan does not). So Pkey scan may read some extra columns (which is not ideal and we should fix) but there should be no other difference in theory.

But maybe there’s more going on. this is a good area to investigate deeper. >>

kmuthukk commented 2 years ago

And @FranckPachot wrote << With same query, SeqScan is faster, with many rocksdb_number_db_next (14 per row) but few rocksdb_number_db_seek which is IMO a good representative of tserver activity. For IndexScan I see 5 rocksdb_number_db_next per row but 2 rocksdb_number_db_seek per row. >>

fritshoogland-yugabyte commented 2 years ago

I tested the case, and simplified the issue to single selects:

  1. explain analyze select discovery_date_d, fire_size from fires; for a seq scan)
  2. set enable_seqscan=off; explain analyze select discovery_date_d, fire_size from fires where objectid < 2000000; for an index scan,
  3. set enable_seqscan=off; explain analyze select discovery_date_d, fire_size from fires where yb_hash_code(objectid) < 65536; for another index scan using yb_hash_code().

Latency of SQLs: 1: Seq scan: 12.5s 2: Ix scan: 12.9s 3: Ix scan: 29.2s

when looking at the statistics, it becomes apparent that the time is fully spent in DocDB/tserver: the statistic handler_latency_yb_tserver_TabletServerService_Read in the server has almost all of the time of the SQL, indicating it was spent in tserver performing a read. The number of calls to DocDB was equal to all 3: 1956 (although I have no idea what this number means).

My current assumption is that the request of yb_hash_code() causes lots of more work on the tserver. The rocksdb statistics show much more work being done in case 3:

1.

rocksdb_db_iter_bytes_read                                                   459849713 bytes     25480673.408 /s
rocksdb_no_table_cache_iterators                                                  1956 iters          108.384 /s
rocksdb_number_db_next                                                        10009732 keys        554647.975 /s
rocksdb_number_db_next_found                                                  10009732 keys        554647.975 /s
rocksdb_number_db_seek                                                         4005852 keys        221967.751 /s
rocksdb_number_db_seek_found                                                   4005852 keys        221967.751 /s

2.

rocksdb_db_iter_bytes_read                                                   459849713 bytes     20019578.276 /s
rocksdb_no_table_cache_iterators                                                  1956 iters           85.155 /s
rocksdb_number_db_next                                                        10009732 keys        435774.140 /s
rocksdb_number_db_next_found                                                  10009732 keys        435774.140 /s
rocksdb_number_db_seek                                                         4005852 keys        174394.950 /s
rocksdb_number_db_seek_found                                                   4005852 keys        174394.950 /s

3.

rocksdb_db_iter_bytes_read                                                   921234211 bytes     23750495.282 /s
rocksdb_no_table_cache_iterators                                                  1956 iters           50.428 /s
rocksdb_number_db_next                                                        28027264 keys        722575.642 /s
rocksdb_number_db_next_found                                                  28027264 keys        722575.642 /s
rocksdb_number_db_seek                                                            1956 keys            50.428 /s
rocksdb_number_db_seek_found                                                      1956 keys            50.428 /s

I took a flamegraph of the tserver when performing the 'quick' index scan and seq scan, and a flamegraph of the yb_hash_function being executed: This would be the time to get people involved who know and understand the yb_hash_code() function implementation. There is so much more going on with the function being applied that it's hard to understand where to start looking.

The first flamegraph shows the fast codepath in tserver: it looks like a single "mountain". ix_fast

This is the flamegraph of the slow yb_hash_code() codepath. It shows multiple mountains, indicating more things are taking place: ix_hash_slow

kmuthukk commented 2 years ago

@fritshoogland-yugabyte -- could you try the same experiment on a recent build, but with packed columns enabled? Would be interesting to see if same results hold?

mtakahar commented 2 years ago

Query 3 calls only one Seek but ended up with lot more Nexts might mean it's seeking to the wrong place with pagination. On the other hand, Query 1 and 2 may be doing something smart using more Seeks? If so, that may be what's missing in Query 3.

If it is indeed related to the pagination state handling, increased fetch size (ysql_prefetch_limit default: 1024) should narrow the gap between Query 1 & 2 vs. 3 both in the execution time and the number of Seeks + Nexts.

fritshoogland-yugabyte commented 2 years ago

As per @kmuthukk request: these are the results for packed rows:

I upgraded an RF1 node to version 2.13.3, and added the gflag --max_packed_row_columns=8. The results for test 1 (seq scan):

ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_db_iter_bytes_read                                                   591979892 bytes     19155445.638 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_no_table_cache_iterators                                                  5862 iters          189.684 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_next                                                        14013670 keys        453458.128 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_next_found                                                  14013670 keys        453458.128 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_seek                                                         4005860 keys        129622.703 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_seek_found                                                   4005860 keys        129622.703 /s

test 2: (index scan using PK)

ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_db_iter_bytes_read                                                   591979892 bytes     40753124.880 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_no_table_cache_iterators                                                  5862 iters          403.552 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_next                                                        14013670 keys        964730.139 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_next_found                                                  14013670 keys        964730.139 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_seek                                                         4005860 keys        275771.720 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_seek_found                                                   4005860 keys        275771.720 /s

test 3 (with yb_hash_code(PK) access via PK index):

ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_db_iter_bytes_read                                                   921238129 bytes     35889131.988 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_no_table_cache_iterators                                                  5862 iters          228.369 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_next                                                        28027373 keys       1091876.310 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_next_found                                                  28027373 keys       1091876.310 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_seek                                                            1978 keys            77.058 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet   rocksdb_number_db_seek_found                                                      1970 keys            76.746 /s

This seems to suggest that packed columns does not change the difference in time and work when yb_hash_code() is used.

mtakahar commented 2 years ago

Found a problem with the yb_hash_code predicate query. It's accessing all the columns in the "table" even though the query uses only 3 (object_id, fire_size and discovery_date_d), leading to way more IntentAwareIterator::SeekPastSubKey calls vs. the other two queries.

Query 1 (no predicate - "SeqScan" (actually a ForeignScan): I0614 14:51:20.545866 147107840 doc_reader.cc:173] DocDBTableReader: Projection: 0x000000010bcd4e60 -> [SystemColumnId(0), ColumnId(6), ColumnId(13)], read time: { read: { physical: 1655232680545219 } local_limit: { physical: 1655232680545219 } global_limit: { physical: 1655232681045268 } in_txn_limit: <invalid> serial_no: 1 }

Query 2 (Index full Scan with out-of-the-upper-bound predicate): I0614 14:51:53.805600 147107840 doc_reader.cc:173] DocDBTableReader: Projection: 0x000000010bcd5ae0 -> [SystemColumnId(0), ColumnId(6), ColumnId(13)], read time: { read: { physical: 1655232713805249 } local_limit: { physical: 1655232713805249 } global_limit: { physical: 1655232714305292 } in_txn_limit: <invalid> serial_no: 41 }

Query 3 (Index full Scan with yb_hash_code predicate): I0614 14:52:38.002321 146571264 doc_reader.cc:173] DocDBTableReader: Projection: 0x000000010cd34be0 -> [SystemColumnId(0), ColumnId(1), ColumnId(2), ColumnId(3), ColumnId(4), ColumnId(5), ColumnId(6), ColumnId(7), ColumnId(8), ColumnId(9), ColumnId(10), ColumnId(11), ColumnId(12), ColumnId(13)], read time: { read: { physical: 1655232758001519 } local_limit: { physical: 1655232758001519 } global_limit: { physical: 1655232758501584 } in_txn_limit: <invalid> serial_no: 81 }

To see the effect of this, I've tried the fires table with only those 3 columns used in the queries, and got comparable timings:

yugabyte=# explain analyze select discovery_date_d, fire_size from fires;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on fires  (cost=0.00..100.00 rows=1000 width=12) (actual time=9.584..13019.465 rows=2000000 loops=1)
 Planning Time: 0.032 ms
 Execution Time: 13186.747 ms
(3 rows)

yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where objectid < 2000000;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using fires_pkey on fires  (cost=0.00..116.50 rows=1000 width=12) (actual time=8.895..13166.596 rows=2000000 loops=1)
   Index Cond: (objectid < 2000000)
 Planning Time: 0.066 ms
 Execution Time: 13328.221 ms
(4 rows)

yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where yb_hash_code(objectid) < 65536;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using fires_pkey on fires  (cost=0.00..119.00 rows=1000 width=12) (actual time=10.589..12951.848 rows=2000000 loops=1)
   Index Cond: (yb_hash_code(objectid) < 65536)
 Planning Time: 0.080 ms
 Execution Time: 13119.961 ms
(4 rows)

yugabyte=# 

cc: @tanujnay112 @m-iancu @sushantrmishra

mtakahar commented 2 years ago

ybcInitColumnFilter (yb_scan.c), which determines which columns to be projected, gives up if it sees a column reference whose attno is InvalidAttrNumber.

    Bitmapset *items = NULL;
    /* Collect bound key attributes */
    AttrNumber *sk_attno = ybScan->target_key_attnums;
    for (AttrNumber *sk_attno_end = sk_attno + ybScan->nkeys; sk_attno != sk_attno_end; ++sk_attno)
        items = bms_add_member(items, *sk_attno - min_attr + 1);

        ... (collect the column references from elsewhere in the query) ...

    /* In case InvalidAttrNumber is set whole row columns are required */
    if (bms_is_member(InvalidAttrNumber - min_attr + 1, items))
        bms_free(items);
    else
        filter->required_attrs = items;

Because the yb_hash_code search condition is translated into a max_hash_code filed in the docDB read request instead of a search key on an actual column, there's no corresponding Postgres column, hence InvalidAttrNumber is placed as a placeholder. A possible fix is not to skip the column references in ybScan->target_key_attnums that corresponds to the yb_hash_code condition.

With this change, the simplified query with yb_hash_code condition performs similar to the others, and the query in the original test case with yb_hash_code(objectid) < 16384 runs in about 1/4th of the time as the one without.

yugabyte=# select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2))) from fires group by year order by year;
 year |       sum       
------+-----------------
 2019 | 375616250000.00
 2020 | 250616500000.00
 2021 | 125616750000.00
 2022 |    617000000.00
(4 rows)

Time: 20609.718 ms (00:20.610)
yugabyte=# select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2))) from fires where yb_hash_code(objectid) < 16384 group by year order by year;
 year |      sum       
------+----------------
 2019 | 94065761133.00
 2020 | 62882155152.00
 2021 | 31392626780.00
 2022 |   154509140.00
(4 rows)

Time: 4826.153 ms (00:04.826)

Simplified queries:

yugabyte=# explain analyze select discovery_date_d, fire_size from fires;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on fires  (cost=0.00..100.00 rows=1000 width=12) (actual time=14.030..19552.225 rows=2000000 loops=1)
 Planning Time: 0.027 ms
 Execution Time: 19716.102 ms
(3 rows)

yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where objectid < 2000000;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using fires_pkey on fires  (cost=0.00..116.50 rows=1000 width=12) (actual time=15.399..19745.553 rows=2000000 loops=1)
   Index Cond: (objectid < 2000000)
 Planning Time: 0.081 ms
 Execution Time: 19905.992 ms
(4 rows)

yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where yb_hash_code(objectid) < 65536;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using fires_pkey on fires  (cost=0.00..119.00 rows=1000 width=12) (actual time=14.457..19672.845 rows=2000000 loops=1)
   Index Cond: (yb_hash_code(objectid) < 65536)
 Planning Time: 0.082 ms
 Execution Time: 19837.977 ms
(4 rows)