Open def- opened 2 years ago
Currently errdetail_busy_db()
only shows the number of other sessions using the database but doesn't give any detail about them.
Looking at CountOtherDBBackends()
, it seems proc->pid
is available when nbackends
is incremented. We can enhance the error message with pid's so that it becomes clear what to clean up.
Dennis and I did some experiment in the past week.
commit 2edca4c440388d04a52767d565fbb75c387fdfec was reverted in the LST runs.
Since then, Dennis hasn't seen DB-3090 popping up in LST runs.
Looking at D18181, for SysCatalogTable::ReadYsqlDBCatalogVersionImpl(), one difference the commit introduced (compared to the previous logic) is that there is a while loop around calling iter->HasNext(). I haven't spotted semantic change in the body of the while loop.
If the while loop, in the previous logic (when versions
is null), is supposed to return one row, we can increment a counter inside the loop add an assertion outside the loop that counter equals one.
I also saw this recently in itest-system, happened sporadically on master:
https://jenkins.dev.yugabyte.com/job/itest-system/2064/testReport/
testbackupwithisolationlevel-aws-rf3: Start
( 0.854s) User Login : Success
( 0.595s) Refresh YB Version : Success
( 181.452s) Setup Provider : Success
( 0.137s) Updating Health Check Interval to 60000 sec : Success
( 423.290s) Create universe jenk-is2064-bcfd2119f1-20221111-092802 : Success
( 10.808s) Updating Health Check Interval to 60000 sec : Success
( 422.865s) Create universe jenk-is2064-bcfd2119f1-20221111-092802-2 : Success
( 40.287s) Start sample workloads : Success
( 0.072s) Check and stop workloads : Success
( 122.114s) Create Backups : Success
( 5.813s) Create Backups : >>> Integration Test Failed <<<
Failed running ysqlsh command with error 'b'ERROR: database "postgres" is being accessed by other users\nDETAIL: There are 12 other sessions using the database.\n''
( 21.113s) Saved server log files and keys at /share/jenkins/workspace/itest-system/logs/2.17.1.0_testbackupwithisolationlevel-aws-rf3_20221111_223055 : Success
( 98.150s) Destroy universe : Success
( 45.959s) Saved server log files and keys at /share/jenkins/workspace/itest-system/logs/2.17.1.0_testbackupwithisolationlevel-aws-rf3_20221111_223254 : Success
( 171.316s) Destroy universe : Success
( 0.339s) Check and stop workloads : Success
testbackupwithisolationlevel-aws-rf3: End
Issue 15675 provides two testcases that reliably reproduce this issue.
Jira Link: DB-3090
Description
After a while the database can't be deleted anymore:
I checked manually via ysqlsh and there are no other sessions running connected to it:
So it seems like failed sessions are not properly cleaned up and might stay around, blocking deletion of YSQL databases. I'll provide the yugabyte-data directory in case it helps. Available from within Yugabyte organization: https://drive.google.com/file/d/1YwryMJEZ0Y5Vq2ltLPOf47rdZZ0-kvLA/view?usp=sharing