Open shamanthchandra-yb opened 7 months ago
cc: @lingamsandeep @rthallamko3
@shamanthchandra-yb can you also post the result that upgrade to 2.20.3.0 here which shows that this isn't a regression introduced by 2024.1?
This is a new stress test designed to stress the YSQL upgrade work flow. Two runs were also performed to upgrade to 2.20.3.0 and both showed similar problem: one or more notes became unreachable. So this confirms that the issue is an existing one that is not a regression introduced in 2024.1. Therefore this issue should not be a 2024.1 release blocker.
The root cause of this issue is that there are multiple YSQL upgrade migration scripts that contain DDL statements that increment the catalog version. Each such DDL statement will cause a catalog cache refresh on all the existing connections across the entire cluster. This resulted in too many catalog cache refreshes during the upgrade process that overrun master.
Despite the fact that tserver cache is already enabled in 2024.1 by default, this test created 33 extra user databases so the total number of databases is 38. Tserver cache entries cannot be shared across different databases. As a result we still see significant number of catalog cache refreshes that need to go to master leader node.
We are doing additional testing to see whether entering per-database catalog version mode earlier can help to mitigate the issue.
Jira Link: DB-10897
Description
Please find slack thread in JIRA description.
This looks fairly consistent to repro.
I have a universe which is of 7 nodes, and RF3 setup. Trying to stress upgrade feature: Upgrade was from:
2.18.4.0-b52
to2024.1.0.0-b53
Testcase:
Workload Example:
Read and write code can be seen here: https://github.com/yugabyte/yb-stress-test/blob/master/tools/sample-app/src/main/java/com/yugabyte/sample/apps/SqlDataLoad.java
Observation:
After upgrade is completed or towards the end of the upgrade, total ops just goes to nearly 0 and stays for a very long time there. Sometimes, without any manual effort it comes up automatically to pre-upgrade level.
In the sample apps, these logs would be seen for nodes where connection was not possible:
We enhanced sample apps metric tracker, just for debugging purpose, with debug support which has, connection host vs number of connections from metric tracker.
Run 1:
i.e. this is delta for ~41 mins for each node, in affected time:
This is extremely low. Infact, like 172.151.18.201, all others hosts are constant and no much new connections, except once in a while few connections are adding. So, cumulative for 40 mins, its added to above numbers.
Example, lets consider this workload’s perf, before upgrade at random time. below data if just for 1 minute. In just ~1 minute, its approx. 2700+ connection on each node.
Run 2:
We also observed multiple hosts going to state where no new connections were established.
In ~8 hours:
3 nodes didn’t create any new connections. Here, it never went to original state.
Please go through slack thread for more information as the RCA is still being discussed between different teams.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information