GROUP BY index_column LIMIT n uses table scan instead of index scan

kmuthukk commented 3 years ago

Jira Link: DB-1328 Issue reported by user on www.yugabyte.com/slack.

When GROUP BY LIMIT is used:

if the indexed column is ordered (e.g. ASC) depending on the value specified for LIMIT a table scan or index scan is used.
if the indexed columnn is HASH organized, then we always seem to be using a table scan

Test Case:

https://gist.github.com/kmuthukk/3d6b325b80c5f5474d7d391f365d51bb

Output:

$ python3 ~/notes/ysql_group_by_index_col.py
Running tests with ASC index for column v
Created table: tab
Created index: tab_v_idx
Inserted 200 rows
Testing with limit: 100
('Limit  (cost=0.00..60.50 rows=100 width=40)',)
('  ->  GroupAggregate  (cost=0.00..121.00 rows=200 width=40)',)
('        Group Key: v',)
('        ->  Index Only Scan using tab_v_idx on tab  (cost=0.00..114.00 rows=1000 width=32)',)
=====================
Testing with limit: 200
('Limit  (cost=105.00..107.00 rows=200 width=40)',)
('  ->  HashAggregate  (cost=105.00..107.00 rows=200 width=40)',)
('        Group Key: v',)
('        ->  Foreign Scan on tab  (cost=0.00..100.00 rows=1000 width=32)',)
=====================
===========================================
Running tests with HASH index for column v
Created table: tab
Created index: tab_v_idx
Inserted 200 rows
Testing with limit: 100
('Limit  (cost=105.00..106.00 rows=100 width=40)',)
('  ->  HashAggregate  (cost=105.00..107.00 rows=200 width=40)',)
('        Group Key: v',)
('        ->  Foreign Scan on tab  (cost=0.00..100.00 rows=1000 width=32)',)
=====================
Testing with limit: 200
('Limit  (cost=105.00..107.00 rows=200 width=40)',)
('  ->  HashAggregate  (cost=105.00..107.00 rows=200 width=40)',)
('        Group Key: v',)
('        ->  Foreign Scan on tab  (cost=0.00..100.00 rows=1000 width=32)',)
=====================
===========================================

kmuthukk commented 3 years ago

Setting enable_hashagg to false is one potential workaround for now to force index-scan in this situation.

postgres=# set enable_hashagg=true;
SET
postgres=# explain select v, count(*) from tab group by v limit 190;
                               QUERY PLAN
-------------------------------------------------------------------------
 Limit  (cost=105.00..106.90 rows=190 width=40)
   ->  HashAggregate  (cost=105.00..107.00 rows=200 width=40)
         Group Key: v
         ->  Foreign Scan on tab  (cost=0.00..100.00 rows=1000 width=32)
(4 rows)

postgres=# set enable_hashagg=false;
SET
postgres=# explain select v, count(*) from tab group by v limit 190;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Limit  (cost=0.00..114.95 rows=190 width=40)
   ->  GroupAggregate  (cost=0.00..121.00 rows=200 width=40)
         Group Key: v
         ->  Index Only Scan using tab_v_idx on tab  (cost=0.00..114.00 rows=1000 width=32)
(4 rows)

sarthak-tomar commented 3 years ago

set enable_hashagg=false; @kmuthukk it's working now , can we set this parameter globally at YugabyteDB Cluster level.

m-iancu commented 3 years ago

@coder2k16 Yes, you can use the tserver gflag ysql_pg_conf (when starting the tserver/cluster) and set that to enable_hashagg=false. Then that will be the default for all sessions.

Note: Example usage of that flag (for configuring auth in that instance): https://docs.yugabyte.com/latest/secure/authentication/password-authentication/#enable-scram-sha-256-authentication

snehalbhavsar916 commented 3 years ago

Thanks it worked for GROUP BY and DISTINCT Queries for a single table. #5866 But it is still not working for the queries having JOIN and having group by both. For this case we still observer same issue, that the query is using index only for certain number of records.

performance_test=# set enable_hashagg=false; SET Time: 1.613 ms performance_test=# explain select x.name , count(*) from test.user x, test.user y where x.usersid_pk=y.usersid_pk group by x.name;

                                   QUERY PLAN

GroupAggregate (cost=697.19..736.69 rows=200 width=66) Group Key: x.name -> Sort (cost=697.19..709.69 rows=5000 width=58) Sort Key: x.name -> Hash Join (cost=112.50..390.00 rows=5000 width=58) Hash Cond: (x.usersid_pk = y.usersid_pk) -> Foreign Scan on "user" x (cost=0.00..100.00 rows=1000 width=62) -> Hash (cost=100.00..100.00 rows=1000 width=4) -> Foreign Scan on "user" y (cost=0.00..100.00 rows=1000 width=4) (9 rows) Time: 12.943 ms performance_test=#

performance_test=# explain select x.name , count(*) from test.user x, test.user y where x.usersid_pk=y.usersid_pk group by x.name limit 100;

                                          QUERY PLAN

Limit (cost=0.00..677.50 rows=100 width=66) -> GroupAggregate (cost=0.00..1355.00 rows=200 width=66) Group Key: x.name -> Nested Loop (cost=0.00..1328.00 rows=5000 width=58) -> Index Scan using idx_name on "user" x (cost=0.00..124.00 rows=1000 width=62) -> Index Only Scan using idx_usersid_pk on "user" y (cost=0.00..1.10 rows=10 width=4) Index Cond: (usersid_pk = x.usersid_pk) (7 rows) Time: 20.519 ms performance_test=#

Please help us to understand what does this variable do and how does it work? Why do we need to set it OFF explicitly in ysql to enable index usage. Do every time we need to set it OFF while executing such queries? Can this issue will get fixed in future releases of yugabyte?

yugabyte / yugabyte-db

GROUP BY index_column LIMIT n uses table scan instead of index scan #5834