Closed kmuthukk closed 2 years ago
@m-iancu wrote << I do remember some subtle differences between primary-key index and seq scan is regarding which columns we read (I think the pkey scan reads/computes the ybctid but the seq scan does not). So Pkey scan may read some extra columns (which is not ideal and we should fix) but there should be no other difference in theory.
But maybe there’s more going on. this is a good area to investigate deeper. >>
And @FranckPachot wrote << With same query, SeqScan is faster, with many rocksdb_number_db_next (14 per row) but few rocksdb_number_db_seek which is IMO a good representative of tserver activity. For IndexScan I see 5 rocksdb_number_db_next per row but 2 rocksdb_number_db_seek per row. >>
I tested the case, and simplified the issue to single selects:
explain analyze select discovery_date_d, fire_size from fires;
for a seq scan)set enable_seqscan=off; explain analyze select discovery_date_d, fire_size from fires where objectid < 2000000;
for an index scan,set enable_seqscan=off; explain analyze select discovery_date_d, fire_size from fires where yb_hash_code(objectid) < 65536
; for another index scan using yb_hash_code().Latency of SQLs: 1: Seq scan: 12.5s 2: Ix scan: 12.9s 3: Ix scan: 29.2s
when looking at the statistics, it becomes apparent that the time is fully spent in DocDB/tserver: the statistic handler_latency_yb_tserver_TabletServerService_Read in the server has almost all of the time of the SQL, indicating it was spent in tserver performing a read. The number of calls to DocDB was equal to all 3: 1956 (although I have no idea what this number means).
My current assumption is that the request of yb_hash_code() causes lots of more work on the tserver. The rocksdb statistics show much more work being done in case 3:
1.
rocksdb_db_iter_bytes_read 459849713 bytes 25480673.408 /s
rocksdb_no_table_cache_iterators 1956 iters 108.384 /s
rocksdb_number_db_next 10009732 keys 554647.975 /s
rocksdb_number_db_next_found 10009732 keys 554647.975 /s
rocksdb_number_db_seek 4005852 keys 221967.751 /s
rocksdb_number_db_seek_found 4005852 keys 221967.751 /s
2.
rocksdb_db_iter_bytes_read 459849713 bytes 20019578.276 /s
rocksdb_no_table_cache_iterators 1956 iters 85.155 /s
rocksdb_number_db_next 10009732 keys 435774.140 /s
rocksdb_number_db_next_found 10009732 keys 435774.140 /s
rocksdb_number_db_seek 4005852 keys 174394.950 /s
rocksdb_number_db_seek_found 4005852 keys 174394.950 /s
3.
rocksdb_db_iter_bytes_read 921234211 bytes 23750495.282 /s
rocksdb_no_table_cache_iterators 1956 iters 50.428 /s
rocksdb_number_db_next 28027264 keys 722575.642 /s
rocksdb_number_db_next_found 28027264 keys 722575.642 /s
rocksdb_number_db_seek 1956 keys 50.428 /s
rocksdb_number_db_seek_found 1956 keys 50.428 /s
I took a flamegraph of the tserver when performing the 'quick' index scan and seq scan, and a flamegraph of the yb_hash_function being executed: This would be the time to get people involved who know and understand the yb_hash_code() function implementation. There is so much more going on with the function being applied that it's hard to understand where to start looking.
The first flamegraph shows the fast codepath in tserver: it looks like a single "mountain".
This is the flamegraph of the slow yb_hash_code() codepath. It shows multiple mountains, indicating more things are taking place:
@fritshoogland-yugabyte -- could you try the same experiment on a recent build, but with packed columns enabled? Would be interesting to see if same results hold?
Query 3 calls only one Seek but ended up with lot more Nexts might mean it's seeking to the wrong place with pagination. On the other hand, Query 1 and 2 may be doing something smart using more Seeks? If so, that may be what's missing in Query 3.
If it is indeed related to the pagination state handling, increased fetch size (ysql_prefetch_limit default: 1024) should narrow the gap between Query 1 & 2 vs. 3 both in the execution time and the number of Seeks + Nexts.
As per @kmuthukk request: these are the results for packed rows:
I upgraded an RF1 node to version 2.13.3, and added the gflag --max_packed_row_columns=8. The results for test 1 (seq scan):
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_db_iter_bytes_read 591979892 bytes 19155445.638 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_no_table_cache_iterators 5862 iters 189.684 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_next 14013670 keys 453458.128 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_next_found 14013670 keys 453458.128 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_seek 4005860 keys 129622.703 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_seek_found 4005860 keys 129622.703 /s
test 2: (index scan using PK)
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_db_iter_bytes_read 591979892 bytes 40753124.880 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_no_table_cache_iterators 5862 iters 403.552 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_next 14013670 keys 964730.139 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_next_found 14013670 keys 964730.139 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_seek 4005860 keys 275771.720 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_seek_found 4005860 keys 275771.720 /s
test 3 (with yb_hash_code(PK) access via PK index):
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_db_iter_bytes_read 921238129 bytes 35889131.988 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_no_table_cache_iterators 5862 iters 228.369 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_next 28027373 keys 1091876.310 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_next_found 28027373 keys 1091876.310 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_seek 1978 keys 77.058 /s
ip-172-162-48-128.eu-west-2.compute.internal:9000 tablet rocksdb_number_db_seek_found 1970 keys 76.746 /s
This seems to suggest that packed columns does not change the difference in time and work when yb_hash_code() is used.
Found a problem with the yb_hash_code predicate query. It's accessing all the columns in the "table" even though the query uses only 3 (object_id
, fire_size
and discovery_date_d
), leading to way more IntentAwareIterator::SeekPastSubKey
calls vs. the other two queries.
Query 1 (no predicate - "SeqScan" (actually a ForeignScan):
I0614 14:51:20.545866 147107840 doc_reader.cc:173] DocDBTableReader: Projection: 0x000000010bcd4e60 -> [SystemColumnId(0), ColumnId(6), ColumnId(13)], read time: { read: { physical: 1655232680545219 } local_limit: { physical: 1655232680545219 } global_limit: { physical: 1655232681045268 } in_txn_limit: <invalid> serial_no: 1 }
Query 2 (Index full Scan with out-of-the-upper-bound predicate):
I0614 14:51:53.805600 147107840 doc_reader.cc:173] DocDBTableReader: Projection: 0x000000010bcd5ae0 -> [SystemColumnId(0), ColumnId(6), ColumnId(13)], read time: { read: { physical: 1655232713805249 } local_limit: { physical: 1655232713805249 } global_limit: { physical: 1655232714305292 } in_txn_limit: <invalid> serial_no: 41 }
Query 3 (Index full Scan with yb_hash_code predicate):
I0614 14:52:38.002321 146571264 doc_reader.cc:173] DocDBTableReader: Projection: 0x000000010cd34be0 -> [SystemColumnId(0), ColumnId(1), ColumnId(2), ColumnId(3), ColumnId(4), ColumnId(5), ColumnId(6), ColumnId(7), ColumnId(8), ColumnId(9), ColumnId(10), ColumnId(11), ColumnId(12), ColumnId(13)], read time: { read: { physical: 1655232758001519 } local_limit: { physical: 1655232758001519 } global_limit: { physical: 1655232758501584 } in_txn_limit: <invalid> serial_no: 81 }
To see the effect of this, I've tried the fires table with only those 3 columns used in the queries, and got comparable timings:
yugabyte=# explain analyze select discovery_date_d, fire_size from fires;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on fires (cost=0.00..100.00 rows=1000 width=12) (actual time=9.584..13019.465 rows=2000000 loops=1)
Planning Time: 0.032 ms
Execution Time: 13186.747 ms
(3 rows)
yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where objectid < 2000000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using fires_pkey on fires (cost=0.00..116.50 rows=1000 width=12) (actual time=8.895..13166.596 rows=2000000 loops=1)
Index Cond: (objectid < 2000000)
Planning Time: 0.066 ms
Execution Time: 13328.221 ms
(4 rows)
yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where yb_hash_code(objectid) < 65536;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using fires_pkey on fires (cost=0.00..119.00 rows=1000 width=12) (actual time=10.589..12951.848 rows=2000000 loops=1)
Index Cond: (yb_hash_code(objectid) < 65536)
Planning Time: 0.080 ms
Execution Time: 13119.961 ms
(4 rows)
yugabyte=#
cc: @tanujnay112 @m-iancu @sushantrmishra
ybcInitColumnFilter
(yb_scan.c), which determines which columns to be projected, gives up if it sees a column reference whose attno is InvalidAttrNumber.
Bitmapset *items = NULL;
/* Collect bound key attributes */
AttrNumber *sk_attno = ybScan->target_key_attnums;
for (AttrNumber *sk_attno_end = sk_attno + ybScan->nkeys; sk_attno != sk_attno_end; ++sk_attno)
items = bms_add_member(items, *sk_attno - min_attr + 1);
... (collect the column references from elsewhere in the query) ...
/* In case InvalidAttrNumber is set whole row columns are required */
if (bms_is_member(InvalidAttrNumber - min_attr + 1, items))
bms_free(items);
else
filter->required_attrs = items;
Because the yb_hash_code search condition is translated into a max_hash_code filed in the docDB read request instead of a search key on an actual column, there's no corresponding Postgres column, hence InvalidAttrNumber is placed as a placeholder. A possible fix is not to skip the column references in ybScan->target_key_attnums that corresponds to the yb_hash_code condition.
With this change, the simplified query with yb_hash_code condition performs similar to the others, and the query in the original test case with yb_hash_code(objectid) < 16384
runs in about 1/4th of the time as the one without.
yugabyte=# select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2))) from fires group by year order by year;
year | sum
------+-----------------
2019 | 375616250000.00
2020 | 250616500000.00
2021 | 125616750000.00
2022 | 617000000.00
(4 rows)
Time: 20609.718 ms (00:20.610)
yugabyte=# select extract(YEAR from discovery_date_d)::varchar(4) AS Year, sum((round((fire_size::decimal),2))) from fires where yb_hash_code(objectid) < 16384 group by year order by year;
year | sum
------+----------------
2019 | 94065761133.00
2020 | 62882155152.00
2021 | 31392626780.00
2022 | 154509140.00
(4 rows)
Time: 4826.153 ms (00:04.826)
Simplified queries:
yugabyte=# explain analyze select discovery_date_d, fire_size from fires;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on fires (cost=0.00..100.00 rows=1000 width=12) (actual time=14.030..19552.225 rows=2000000 loops=1)
Planning Time: 0.027 ms
Execution Time: 19716.102 ms
(3 rows)
yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where objectid < 2000000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using fires_pkey on fires (cost=0.00..116.50 rows=1000 width=12) (actual time=15.399..19745.553 rows=2000000 loops=1)
Index Cond: (objectid < 2000000)
Planning Time: 0.081 ms
Execution Time: 19905.992 ms
(4 rows)
yugabyte=# explain analyze /*+ Set(enable_seqscan off) */select discovery_date_d, fire_size from fires where yb_hash_code(objectid) < 65536;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using fires_pkey on fires (cost=0.00..119.00 rows=1000 width=12) (actual time=14.457..19672.845 rows=2000000 loops=1)
Index Cond: (yb_hash_code(objectid) < 65536)
Planning Time: 0.082 ms
Execution Time: 19837.977 ms
(4 rows)
Jira Link: DB-1003
Description
The hash space for hash partitioned tables is [0..64k). When executing a query using predicate say
yb_hash_code(pk) < 16384
to query 1/4th of the table, the query run-time doesn't seem to be about 1/4th of the equivalent query (i.e. without the predicate) as one would expect.Load script: https://gist.github.com/kmuthukk/4c7a7e77be023e80be08b838544903b8 Env: RF=1 yb-ctl based setup on my Linux dev server (4-core machine); 2m rows loaded. YugabyteDB version: 2.11.1
Query on full data set takes about 15sec (see below); but query using
yb_hash_code(objectid) < 16384
takes 8.5seconds.. (rather than about 4.5secs.)The rows in the table are uniformly distributed; i.e. the range we are interested in does have about 1/4th the rows (500K of 2M rows).
The plan for one shows "SeqScan" vs. "Index Scan" for the other.. but both should essentially be scans on the primary key (PK), because in YugabyteDB the table is organized on disk in PK order in DocDB/RocksDB files.