[YSQL] After upgrade, total ops dropped to ~0 because of decrease in connection rate (Few nodes would have 0 new connections for hours)

shamanthchandra-yb commented 7 months ago

Description

Please find slack thread in JIRA description.

This looks fairly consistent to repro.

I have a universe which is of 7 nodes, and RF3 setup. Trying to stress upgrade feature: Upgrade was from: 2.18.4.0-b52 to 2024.1.0.0-b53

Testcase:

Create universe with 7 nodes and RF3/RF5 on 2.18.4.0-b52
Start as many workloads on different databases, with 1 table in each database.
Mostly around 20 to 25 workloads on different databases would be started.

Workload Example:

java -jar /tmp/tests/artifacts/stress-sample-app-tool/yb-stress-sample-apps-1.1.43.jar --workload SqlDataLoad --num_writes -1 --num_threads_write 1 --num_threads_read 1 --num_reads -1 --num_unique_keys 100000000000000 --batch_size 3 --num_value_columns 10 --create_table_name table_db_28015f --default_postgres_database db_28015f --uuid_column --uuid 7a8a205e-51f2-447d-849e-3a2a593c5ecb --large_key_multiplier 3 --large_value_multiplier 3 --use_hikari true --uuid_marker db_28015f_file --nodes 172.151.23.72:5433,172.151.23.146:5433,172.151.26.159:5433,172.151.25.68:5433,172.151.18.7:5433,172.151.31.67:5433,172.151.17.128:5433

Read and write code can be seen here: https://github.com/yugabyte/yb-stress-test/blob/master/tools/sample-app/src/main/java/com/yugabyte/sample/apps/SqlDataLoad.java

Observation:

After upgrade is completed or towards the end of the upgrade, total ops just goes to nearly 0 and stays for a very long time there. Sometimes, without any manual effort it comes up automatically to pre-upgrade level.

In the sample apps, these logs would be seen for nodes where connection was not possible:

2024-04-10 11:06:56,550 [Thread-3] ERROR AppBase - Going to retrieve connection again: HikariPool-1 - Connection is not available, request timed out after 10000ms.
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 10000ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:181)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnectionFromDatasource(AppBase.java:308)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnection(AppBase.java:255)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnection(AppBase.java:248)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnection(AppBase.java:243)
        at com.yugabyte.sample.apps.SqlDataLoad.doRead(SqlDataLoad.java:310)
        at com.yugabyte.sample.apps.AppBase.performRead(AppBase.java:1011)
        at com.yugabyte.sample.common.IOPSThread.run(IOPSThread.java:89)
Caused by: com.yugabyte.util.PSQLException: This connection has been closed.
        at com.yugabyte.jdbc.PgConnection.checkClosed(PgConnection.java:917)
        at com.yugabyte.jdbc.PgConnection.setNetworkTimeout(PgConnection.java:1654)
        at com.zaxxer.hikari.pool.PoolBase.setNetworkTimeout(PoolBase.java:561)
        at com.zaxxer.hikari.pool.PoolBase.isConnectionDead(PoolBase.java:168)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:170)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnectionFromDatasource(AppBase.java:308)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnection(AppBase.java:255)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnection(AppBase.java:248)
        at com.yugabyte.sample.apps.AppBase.getPostgresConnection(AppBase.java:243)
        at com.yugabyte.sample.apps.SqlDataLoad.doWrite(SqlDataLoad.java:403)
        at com.yugabyte.sample.apps.AppBase.performWrite(AppBase.java:978)
        at com.yugabyte.sample.common.IOPSThread.run(IOPSThread.java:88)

We enhanced sample apps metric tracker, just for debugging purpose, with debug support which has, connection host vs number of connections from metric tracker.

Run 1:

After upgrade, for about ~40 mins, there was a problem and similar issue is seen.
Affected time: 2024-04-14 22:09:36,621 to 2024-04-14 22:50:51,718 UTC
There are about 20 workloads, running on 20 different databases like earlier. Picking one workload to analyse here:

In 41 mins,

Hostname: 172.151.18.201: 529104, 172.151.18.98: 530854, 172.151.27.157: 528979, 172.151.26.32: 530446, 172.151.28.162: 529441, 172.151.24.229: 529308, 172.151.27.188: 529340
to
Hostname: 172.151.18.201: 529104, 172.151.18.98: 531085, 172.151.27.157: 529179, 172.151.26.32: 530655, 172.151.28.162: 529666, 172.151.24.229: 529540, 172.151.27.188: 529553

i.e. this is delta for ~41 mins for each node, in affected time:

172.151.18.201: 0 new connections
172.151.18.98: +231 new connections
172.151.27.157: +200 new connections
172.151.26.32: +209 new connections
172.151.28.162: +225 new connections
172.151.24.229: +232 new connections
172.151.27.188: +213 new connections

This is extremely low. Infact, like 172.151.18.201, all others hosts are constant and no much new connections, except once in a while few connections are adding. So, cumulative for 40 mins, its added to above numbers.

Example, lets consider this workload’s perf, before upgrade at random time. below data if just for 1 minute. In just ~1 minute, its approx. 2700+ connection on each node.

172.151.18.201: +2798 connections
172.151.18.98: +2776 connections
172.151.27.157: +2697 connections
172.151.26.32: +2803 connections
172.151.28.162: +2826 connections
172.151.24.229: +2780 connections
172.151.27.188: +2766 connections

Screenshot 2024-04-15 at 11 54 59 AM

Run 2:

We also observed multiple hosts going to state where no new connections were established.

Upgrade happened: Apr-14-2024 21:54:22 UTC+0000 to Apr-14-2024 22:25:48 UTC+0000

In ~8 hours:

172.151.30.124: No new connections
172.151.18.89: +657 connections
172.151.24.79: +686 connections
172.151.30.27: No new connections
172.151.29.77: +692 connections
172.151.26.121: +692 connections
172.151.18.225: No new connections

3 nodes didn’t create any new connections. Here, it never went to original state.

Screenshot 2024-04-15 at 2 21 37 PM

Please go through slack thread for more information as the RCA is still being discussed between different teams.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

shamanthchandra-yb commented 7 months ago

cc: @lingamsandeep @rthallamko3

myang2021 commented 7 months ago

@shamanthchandra-yb can you also post the result that upgrade to 2.20.3.0 here which shows that this isn't a regression introduced by 2024.1?

myang2021 commented 7 months ago

This is a new stress test designed to stress the YSQL upgrade work flow. Two runs were also performed to upgrade to 2.20.3.0 and both showed similar problem: one or more notes became unreachable. So this confirms that the issue is an existing one that is not a regression introduced in 2024.1. Therefore this issue should not be a 2024.1 release blocker.

The root cause of this issue is that there are multiple YSQL upgrade migration scripts that contain DDL statements that increment the catalog version. Each such DDL statement will cause a catalog cache refresh on all the existing connections across the entire cluster. This resulted in too many catalog cache refreshes during the upgrade process that overrun master.

Despite the fact that tserver cache is already enabled in 2024.1 by default, this test created 33 extra user databases so the total number of databases is 38. Tserver cache entries cannot be shared across different databases. As a result we still see significant number of catalog cache refreshes that need to go to master leader node.

We are doing additional testing to see whether entering per-database catalog version mode earlier can help to mitigate the issue.

yugabyte / yugabyte-db