yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[YSQL][LST] ERROR: database "..." is being accessed by other users #13495

Open def- opened 2 years ago

def- commented 2 years ago

Jira Link: DB-3090

Description

$ cd ~/code/yugabyte-db
$ git checkout 9551e72e665bbb3650463e91d3c4a8ddb072df43
$ arc patch D18465
$ ./yb_build.sh release
$ bin/yb-ctl --replication_factor 3 create --tserver_flags=yb_enable_read_committed_isolation=true,ysql_num_shards_per_tserver=1,enable_stream_compression=true,stream_compression_algo=1,yb_num_shards_per_tserver=1 --master_flags=yb_enable_read_committed_isolation=true,enable_stream_compression=true,stream_compression_algo=1,enable_automatic_tablet_splitting=true,tablet_split_low_phase_shard_count_per_node=40,tablet_split_high_phase_shard_count_per_node=50
$ cd ~/code/yb-long-system-test
$ git checkout 2027af8b
$ while true; do python3.9 ./long_system_test.py --nodes=127.0.0.1:5433,127.0.0.2:5433,127.0.0.3:5433 --threads=10 --runtime=60 --max-columns=20 --complexity=full --seed=370648; done

After a while the database can't be deleted anymore:

  File "/home/deen/code/yb-long-system-test/lst/dbobject.py", line 17, in execute
    cur.execute(query)
psycopg2.errors.ObjectInUse: database "db_lst_370648" is being accessed by other users
DETAIL:  There are 2 other sessions using the database.

I checked manually via ysqlsh and there are no other sessions running connected to it:

[deen@devp yugabyte-db]$ bin/ysqlsh
ysqlsh (11.2-YB-2.15.2.0-b0)
Type "help" for help.

yugabyte=# SELECT * FROM pg_stat_activity;
 datid | datname  |  pid   | usesysid | usename  | application_name | client_addr | client_hostname | clie
nt_port |         backend_start         |          xact_start           |          query_start          |
        state_change          | wait_event_type |    wait_event    | state  | backend_xid | backend_xmin |
              query              |  backend_type
-------+----------+--------+----------+----------+------------------+-------------+-----------------+-----
--------+-------------------------------+-------------------------------+-------------------------------+-
------------------------------+-----------------+------------------+--------+-------------+--------------+
---------------------------------+----------------
 13288 | yugabyte | 616833 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |
  33722 | 2022-08-01 08:39:54.590703+00 | 2022-08-01 08:40:10.907235+00 | 2022-08-01 08:40:10.907235+00 |
2022-08-01 08:40:10.907349+00 |                 |                  | active |             |            3 |
 SELECT * FROM pg_stat_activity; | client backend
       |          | 303346 |          |          |                  |             |                 |
        | 2022-07-29 22:37:50.070651+00 |                               |                               |
                              | Activity        | CheckpointerMain |        |             |              |
                                 | checkpointer
(2 rows)

yugabyte=# drop database db_lst_370648
yugabyte-# ;
ERROR:  database "db_lst_370648" is being accessed by other users
DETAIL:  There are 2 other sessions using the database.

So it seems like failed sessions are not properly cleaned up and might stay around, blocking deletion of YSQL databases. I'll provide the yugabyte-data directory in case it helps. Available from within Yugabyte organization: https://drive.google.com/file/d/1YwryMJEZ0Y5Vq2ltLPOf47rdZZ0-kvLA/view?usp=sharing

tedyu commented 2 years ago

Currently errdetail_busy_db() only shows the number of other sessions using the database but doesn't give any detail about them. Looking at CountOtherDBBackends(), it seems proc->pid is available when nbackends is incremented. We can enhance the error message with pid's so that it becomes clear what to clean up.

tedyu commented 2 years ago

Dennis and I did some experiment in the past week.

commit 2edca4c440388d04a52767d565fbb75c387fdfec was reverted in the LST runs.

Since then, Dennis hasn't seen DB-3090 popping up in LST runs.

Looking at D18181, for SysCatalogTable::ReadYsqlDBCatalogVersionImpl(), one difference the commit introduced (compared to the previous logic) is that there is a while loop around calling iter->HasNext(). I haven't spotted semantic change in the body of the while loop.

If the while loop, in the previous logic (when versions is null), is supposed to return one row, we can increment a counter inside the loop add an assertion outside the loop that counter equals one.

def- commented 1 year ago

I also saw this recently in itest-system, happened sporadically on master:

https://jenkins.dev.yugabyte.com/job/itest-system/2064/testReport/
testbackupwithisolationlevel-aws-rf3: Start
    (     0.854s) User Login : Success
    (     0.595s) Refresh YB Version : Success
    (   181.452s) Setup Provider : Success
    (     0.137s) Updating Health Check Interval to 60000 sec : Success
    (   423.290s) Create universe jenk-is2064-bcfd2119f1-20221111-092802 : Success
    (    10.808s) Updating Health Check Interval to 60000 sec : Success
    (   422.865s) Create universe jenk-is2064-bcfd2119f1-20221111-092802-2 : Success
    (    40.287s) Start sample workloads : Success
    (     0.072s) Check and stop workloads : Success
    (   122.114s) Create Backups : Success
    (     5.813s) Create Backups : >>> Integration Test Failed <<<
Failed running ysqlsh command with error 'b'ERROR:  database "postgres" is being accessed by other users\nDETAIL:  There are 12 other sessions using the database.\n''
    (    21.113s) Saved server log files and keys at /share/jenkins/workspace/itest-system/logs/2.17.1.0_testbackupwithisolationlevel-aws-rf3_20221111_223055 : Success
    (    98.150s) Destroy universe : Success
    (    45.959s) Saved server log files and keys at /share/jenkins/workspace/itest-system/logs/2.17.1.0_testbackupwithisolationlevel-aws-rf3_20221111_223254 : Success
    (   171.316s) Destroy universe : Success
    (     0.339s) Check and stop workloads : Success
testbackupwithisolationlevel-aws-rf3: End
bllewell commented 1 year ago

Issue 15675 provides two testcases that reliably reproduce this issue.