yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9k stars 1.07k forks source link

[YSQL][SQLsmith] Stuck query in ReplicationSlotAllocationLock, can't be cancelled #11220

Open def- opened 2 years ago

def- commented 2 years ago

Jira Link: DB-858

Description

After a few hours SQLsmith gets stuck in ReplicationSlotAllocationLock, I can reproduce it with the last run query:

select
  subq_0.c0 as c0,
  subq_0.c2 as c1,
  subq_0.c2 as c2,
  subq_0.c0 as c3
from
  (select
        82 as c0,
        ref_0.privilege_type as c1,
          pg_catalog.bit_or(
            cast(ref_2.attlen as int2)) over (partition by ref_0.object_schema order by ref_2.attinhcount) as c2,
        pg_catalog.pg_create_physical_replication_slot(
          cast(ref_2.attname as name),
          cast(ref_2.atthasmissing as bool),
          cast(pg_catalog.yb_is_database_colocated() as bool)) as c3
      from
        information_schema.role_usage_grants as ref_0
            inner join information_schema.sql_parts as ref_1
            on ((false)
                and ((cast(null as int8) <= cast(null as int2))
                  and (89 <= 34)))
          right join pg_catalog.pg_attribute as ref_2
          on (ref_2.attnum = ref_2.atttypmod)
      where true
      limit 95) as subq_0
where true
limit 20;

The query seems to stay around:

\x
select *
  from pg_stat_activity
  where datname = 'yugabyte';
-[ RECORD 1 ]----+------------------------------------------------------------------------------------------------------------------
datid            | 13281
datname          | yugabyte
pid              | 46014
usesysid         | 13280
usename          | yugabyte
application_name | ysqlsh
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 64372
backend_start    | 2022-01-26 14:21:25.051697+01
xact_start       | 2022-01-26 14:24:47.538128+01
query_start      | 2022-01-26 14:24:47.538128+01
state_change     | 2022-01-26 14:24:47.538181+01
wait_event_type  | LWLock
wait_event       | ReplicationSlotAllocationLock
state            | active
backend_xid      |
backend_xmin     | 799
query            | select
[...]
backend_type     | client backend

Cancelling with Ctrl-C has no effect.

sushantrmishra commented 2 years ago

Creating the slot using pg_create_physical_replication_slot is required for Streaming/replicating changes from a physical slot in PostgreSQL. In YugabyteDB, replication does not use streaming/replication protocol used in PostgreSQL, hence this function is not used and also not supported. Though we should return an more relevant error message in YugabyteDB when pg_create_physical_replication_slot is used.

def- commented 1 year ago

Similar situation with pg_catalog.pg_replication_origin_xact_reset(), might also be expected:

yugabyte=# \x
Expanded display is on.
yugabyte=# SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------
pid     | 628719
age     |
usename |
query   |
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------
pid     | 663159
age     | 02:53:51.284042
usename | yugabyte
query   | select                                                                                                          +
        |   34 as c0,                                                                                                     +
        |   subq_0.c0 as c1,                                                                                              +
        |   subq_0.c4 as c2,                                                                                              +
        |   subq_0.c4 as c3,                                                                                              +
        |   pg_catalog.pg_replication_origin_xact_reset() as c4,                                                          +
        |                                                                                                                 +
        |     pg_catalog.var_samp(                                                                                        +
        |       cast((select feature_col from public.feature_tab_real limit 1 offset 1)                                   +
        |          as float4)) over (partition by subq_0.c2,subq_0.c5 order by subq_0.c6) as c5,                          +
        |   (select a from public.pp_enumpart limit 1 offset 6)                                                           +
        |      as c6,                                                                                                     +
        |   subq_0.c6 as c7,                                                                                              +
        |   subq_0.c5 as c8,                                                                                              +
        |   subq_0.c1 as c9,                                                                                              +
        |   dbms_pipe.create_pipe(                                                                                        +
        |     cast(cast(nullif(                                                                                           +
        |         pg_catalog.max(                                                                                         +
        |           cast(utl_file.tmpdir() as text)) over (partition by subq_0.c6,subq_0.c6 order by subq_0.c1,subq_0.c0),+
        |       oracle.sessiontimezone()) as text) as text)) as c10,                                                      +
        |   (select pg_catalog.max(feature_col) from public.feature_tab_time)                                             +
        |      as c11                                                                                                     +
        | from                                                                                                            +
        |   (select                                                                                                       +
        |         ref_0.b as c0,                                                                                          +
        |         ref_0.b as c1,                                                                                          +
        |         ref_0.b as c2,                                                                                          +
        |         ref_0.a as c3,                                                                                          +
        |         cast(coalesce(ref_0.b,                                                                                  +
        |           ref_0.b) as int4) as c4,                                                                              +
        |         ref_0.b as c5,                                                                                          +
        |         ref_0.a as c6                                                                                           +
        |       from                                                                                                      +
        |

yugabyte=# select *
  from pg_stat_activity
  where datname = 'yugabyte';
-[ RECORD 1 ]----+-----------------------------------------------------------------------------------------------------------------
datid            | 13288
datname          | yugabyte
pid              | 663159
usesysid         | 13287
usename          | yugabyte
application_name | sqlsmith::dut
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 42266
backend_start    | 2022-12-28 20:26:41.171935+00
xact_start       | 2022-12-28 20:45:28.677106+00
query_start      | 2022-12-28 20:45:28.677177+00
state_change     | 2022-12-28 20:45:28.67742+00
wait_event_type  | LWLock
wait_event       | extension
state            | active
backend_xid      |
backend_xmin     | 28
query            | select                                                                                                          +
                 |   34 as c0,                                                                                                     +
                 |   subq_0.c0 as c1,                                                                                              +
                 |   subq_0.c4 as c2,                                                                                              +
                 |   subq_0.c4 as c3,                                                                                              +
                 |   pg_catalog.pg_replication_origin_xact_reset() as c4,                                                          +
                 |                                                                                                                 +
                 |     pg_catalog.var_samp(                                                                                        +
                 |       cast((select feature_col from public.feature_tab_real limit 1 offset 1)                                   +
                 |          as float4)) over (partition by subq_0.c2,subq_0.c5 order by subq_0.c6) as c5,                          +
                 |   (select a from public.pp_enumpart limit 1 offset 6)                                                           +
                 |      as c6,                                                                                                     +
                 |   subq_0.c6 as c7,                                                                                              +
                 |   subq_0.c5 as c8,                                                                                              +
                 |   subq_0.c1 as c9,                                                                                              +
                 |   dbms_pipe.create_pipe(                                                                                        +
                 |     cast(cast(nullif(                                                                                           +
                 |         pg_catalog.max(                                                                                         +
                 |           cast(utl_file.tmpdir() as text)) over (partition by subq_0.c6,subq_0.c6 order by subq_0.c1,subq_0.c0),+
                 |       oracle.sessiontimezone()) as text) as text)) as c10,                                                      +
                 |   (select pg_catalog.max(feature_col) from public.feature_tab_time)                                             +
                 |      as c11                                                                                                     +
                 | from                                                                                                            +
                 |   (select                                                                                                       +
                 |         ref_0.b as c0,                                                                                          +
                 |         ref_0.b as c1,                                                                                          +
                 |         ref_0.b as c2,                                                                                          +
                 |         ref_0.a as c3,                                                                                          +
                 |         cast(coalesce(ref_0.b,                                                                                  +
                 |           ref_0.b) as int4) as c4,                                                                              +
                 |         ref_0.b as c5,                                                                                          +
                 |         ref_0.a as c6                                                                                           +
                 |       from                                                                                                      +
                 |
backend_type     | client backend
-[ RECORD 2 ]----+-----------------------------------------------------------------------------------------------------------------
datid            | 13288
datname          | yugabyte
pid              | 673357
usesysid         | 13287
usename          | yugabyte
application_name | ysqlsh
client_addr      | 127.0.0.1
client_hostname  |
client_port      | 42320
backend_start    | 2022-12-28 23:37:47.377639+00
xact_start       | 2022-12-28 23:39:22.896183+00
query_start      | 2022-12-28 23:39:22.896183+00
state_change     | 2022-12-28 23:39:22.896257+00
wait_event_type  |
wait_event       |
state            | active
backend_xid      |
backend_xmin     | 28
query            | select *                                                                                                        +
                 |   from pg_stat_activity                                                                                         +
                 |   where datname = 'yugabyte';
backend_type     | client backend

The processes stay around on a yb-ctl stop:

[deen@devp yugabyte-db]$ ps aux|grep postgres
deen      628709  0.1  0.2 152892 47284 ?        Ss   Dec28   1:20 postgres: logger
deen      663159  0.4  0.7 448152 116352 ?       Ssl  Dec28   3:01 postgres: yugabyte yugabyte 127.0.0.1(42266) SELECT
deen      673500  0.0  0.4 366156 79008 ?        Ssl  Dec28   0:06 postgres: yugabyte yugabyte 127.0.0.1(42324) SELECT
deen      673710  0.0  0.4 371276 73728 ?        Ssl  Dec28   0:05 postgres: yugabyte yugabyte 127.0.0.1(42332) INSERT

This even hung yb-ctl destroy:

Destroying cluster.
2022-12-28 23:45:08,922 INFO: Stopping Server master-1 PID=628089
2022-12-28 23:45:08,922 INFO: Waiting for Server master-1 PID=628089 to stop...
2022-12-28 23:45:09,437 INFO: Stopping Server master-2 PID=628092
2022-12-28 23:45:09,437 INFO: Waiting for Server master-2 PID=628092 to stop...
2022-12-28 23:45:09,953 INFO: Stopping Server master-3 PID=628095
2022-12-28 23:45:09,953 INFO: Waiting for Server master-3 PID=628095 to stop...
2022-12-28 23:45:10,468 INFO: Stopping Server tserver-1 PID=628098
2022-12-28 23:45:10,469 INFO: Waiting for Server tserver-1 PID=628098 to stop...
2022-12-28 23:45:10,969 INFO: Stopping Postmaster for tserver-1 PID=628638
2022-12-28 23:45:10,969 INFO: Waiting for Postmaster for tserver-1 PID=628638 to stop...
2022-12-28 23:45:11,970 INFO: Waiting for Postmaster for tserver-1 PID=628638 to stop...
2022-12-28 23:45:12,971 INFO: Waiting for Postmaster for tserver-1 PID=628638 to stop...

It only reacts to a kill -9

[deen@devp ~]$ ps aux|grep 628638
deen      628638  0.0  0.4 278488 67084 ?        S    Dec28   0:01 /home/deen/code/yugabyte-db/build/release-clang15-linuxbrew-dynamic/postgres/bin/postgres -D /home/deen/yugabyte-data/node-1/disk-1/pg_data -p 5433 -h 127.0.0.1 -k /tmp/.yb.127.0.0.1:5433 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/home/deen/yugabyte-data/node-1/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=devp:9000 -c yb_pg_metrics.port=13000 -c config_file=/home/deen/yugabyte-data/node-1/disk-1/pg_data/ysql_pg.conf -c hba_file=/home/deen/yugabyte-data/node-1/disk-1/pg_data/ysql_hba.conf