[YSQL] Too many pages read with LIMIT and Index Scan on compound index

FranckPachot commented 2 years ago

Jira Link: DB-672

Description

The full case is from this blog post on Index Skip Scan: https://dev.to/franckpachot/index-skip-scan-in-yugabytedb-2ao2

I encountered a case where too many rows are read, by small pages,with query using LIMIT

Test case:

create table demo ( A int, B int, primary key(A ASC, B ASC) );
insert into demo select a,b from generate_series(1,100) a, generate_series(1,100000) b;
explain (costs off, analyze) select * from demo where A>50 and B>5000 limit 10;

The plan looks good, except that the execution time is high (3 seconds) to get 10 rows by Index Scan on the primary key:

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Limit (actual time=3327.396..3327.892 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=3327.395..3327.889 rows=10 loops=1)
         Index Cond: ((a > 50) AND (b > 5000))
 Planning Time: 0.582 ms
 Execution Time: 3327.929 ms
(5 rows)

The DocDB statistics show many reads

        row_name         | rocksdb_number_db_seek | rocksdb_number_db_next
-------------------------+------------------------+------------------------
 yugabyte demo 10.0.0.62 |                   9503 |                 104524

The yb_debug_log_docdb_requests=true show pagination, reading 9500 pages of 10 rows from (A=50, B=5010) to (A=50, B=100000):

I0404 08:39:53.944916 3937685 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 610383152 schema_version: 8 targets { column_id: 0 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 } is_forward_scan: true is_aggregate: false limit: 10 return_paging_state: true ysql_catalog_version: 156 table_id: "000033e1000030008000000000004206" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 0 } operands { value { int32_value: 50 } } } } operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int32_value: 5000 } } } } } }
I0404 08:39:53.947188 3937685 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 610383152 schema_version: 8 targets { column_id: 0 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 } is_forward_scan: true is_aggregate: false limit: 10 paging_state { next_partition_key: "H\200\000\0002H\200\000\023\222!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0002H\200\000\023\222!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 10 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } } return_paging_state: true table_id: "000033e1000030008000000000004206" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 0 } operands { value { int32_value: 50 } } } } operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int32_value: 5000 } } } } } } upper_bound { key: "~~~!" is_inclusive: true }, response: status: PGSQL_STATUS_OK rows_data_sidecar: 0 paging_state { }
I0404 08:39:53.948179 3937685 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 610383152 schema_version: 8 targets { column_id: 0 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 } is_forward_scan: true is_aggregate: false limit: 10 paging_state { next_partition_key: "H\200\000\0002H\200\000\023\234!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0002H\200\000\023\234!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 20 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } } return_paging_state: true table_id: "000033e1000030008000000000004206" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 0 } operands { value { int32_value: 50 } } } } operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int32_value: 5000 } } } } } } upper_bound { key: "~~~!" is_inclusive: true }, response: status: PGSQL_STATUS_OK rows_data_sidecar: 0 paging_state { next_partition_key: "H\200\000\0002H\200\000\023\222!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0002H\200\000\023\222!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 10 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } }
I040
...

I0404 08:39:58.315771 3937685 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 610383152 schema_version: 8 targets { column_id: 0 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 } is_forward_scan: true is_aggregate: false limit: 10 paging_state { next_partition_key: "H\200\000\0002H\200\001\206\240!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0002H\200\001\206\240!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 95000 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } } return_paging_state: true table_id: "000033e1000030008000000000004206" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 0 } operands { value { int32_value: 50 } } } } operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int32_value: 5000 } } } } } } upper_bound { key: "~~~!" is_inclusive: true }, response: status: PGSQL_STATUS_OK rows_data_sidecar: 0 paging_state { next_partition_key: "H\200\000\0002H\200\001\206\226!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0002H\200\001\206\226!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 94990 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } }
I0404 08:39:58.316187 3937685 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 610383152 schema_version: 8 targets { column_id: 0 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 } is_forward_scan: true is_aggregate: false limit: 10 paging_state { next_partition_key: "H\200\000\0003H\200\000\023\221!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0003H\200\000\023\221!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 95010 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } } return_paging_state: true table_id: "000033e1000030008000000000004206" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 0 } operands { value { int32_value: 50 } } } } operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int32_value: 5000 } } } } } } upper_bound { key: "~~~!" is_inclusive: true }, response: status: PGSQL_STATUS_OK rows_data_sidecar: 0 paging_state { next_partition_key: "H\200\000\0002H\200\001\206\240!#\200\001xm\344\245\\\205\200J" next_row_key: "H\200\000\0002H\200\001\206\240!#\200\001xm\344\245\\\205\200J" total_num_rows_read: 95000 read_time { read_ht: 6754556288613982208 DEPRECATED_max_of_read_time_and_local_limit_ht: 6754556290661982208 global_limit_ht: 6754556290661982208 in_txn_limit_ht: 6754556288798355456 local_limit_ht: 6754556290661982208 } }

Note:

"H\200\000\0002H\200\000\023\222!#\200\" is (A=50, B=5010)
"H\200\000\0002H\200\001\206\240!#\200" is (A=50, B=100000):

The first page should have been sufficient for the query. But it seems that all the 95000 rows for A=50 have been read, by pages of 10.

Workaround:

explain (costs off, analyze) 
with v as (
 select * from demo where A>50 and B>5000 limit 10000
) select * from v limit 10;

This is faster because it reads by larger page, and filters afterwards.

mtakahar commented 2 years ago

Because of lack of strict inequality match in docDB (#10738 - @tanujnay112 has a fix but has not landed yet), some (actually many, with this test case) got rejected on the Postgres side.

Those extra filtering happening within the scan node such as IndexScan, etc. are sometimes reported as "Rows Removed by Index Recheck" in the EXPLAIN ANALYZE output, however, that is not always the case.

indextuple_matches_key (the same thing in heaptuple_matches_key, too):

        if (is_null)
            return false;

        bool matches = DatumGetBool(FunctionCall2Coll(&key[i].sk_func,
                                                      key[i].sk_collation,
                                                      res_datum,
                                                      key[i].sk_argument));
        if (!matches)
            return false;
    }

    return true;
}

cc: @m-iancu @sushantrmishra

mtakahar commented 2 years ago

A modified version of the query with "=" stops after receiving the first batch. e.g.: explain (costs off, analyze) select * from demo where A>=50 and B>=5000 limit 10;

I've also tried the original query with @tanujnay112's fix #10738, and it no longer issues extra fetch calls.

sushantrmishra commented 1 year ago

Tested with 2.17.2

yugabyte=# explain (analyze, dist) select * from demo where A>50 and B>5000 limit 10;

                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.00 rows=1 width=8) (actual time=5166.045..5166.052 rows=10 loops=1)
   ->  Seq Scan on demo  (cost=0.00..0.00 rows=1 width=8) (actual time=5166.042..5166.047 rows=10 loops=1)
         Remote Filter: ((a > 50) AND (b > 5000))
         Storage Table Read Requests: 1
         Storage Table Execution Time: 5166.040 ms
 Planning Time: 0.083 ms
 Execution Time: 5166.119 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 5166.040 ms
 Peak Memory Usage: 14 kB
(11 rows)

Time: 5166.911 ms (00:05.167)

FranckPachot commented 1 year ago

Yes, fixed

yugabyte / yugabyte-db

[YSQL] Too many pages read with LIMIT and Index Scan on compound index #11965

Description