Feature Request: when the primary tablet is unreachable, instead of throwing error immediately, pause current SQL executation and wait for a timeout

weicao commented 1 year ago

Feature Description

After I kill the wesql-server leader node, wesql-scale will immdiately return an error for subsequent SQL queries until a new leader is elected and the wesql-server cluster is recovered.

kill the leader node

$ kubectl delete pod my-wesqlscale-cluster-mysql-0
pod "my-wesqlscale-cluster-mysql-0" deleted

wesql-scale returns error like this

MySQL [information_schema]> select * from information_schema.wesql_cluster_global;
ERROR 1105 (HY000): target: _vt.0.primary: primary is not serving, there is a reparent operation in progress

This behavior is not expected by users, because usually applications use database connection pooling, when any error happens to the connection, the connection will be terminated and removed from conn_pool, a new connection will be created and put in the pool. If the wesql-server cluster can be recovered in a short period of time, these error handling operations bring unnecessary costs.

I would like SQL queries are paused for a while and answer as usual after the wesql-server cluster recovers, as

MySQL [information_schema]> select * from information_schema.wesql_cluster_global;

... waiting for several seconds (of course the shorter the better) ...

+-----------+------------------------------------------------------------------+-------------+------------+----------+-----------+------------+-----------------+----------------+---------------+------------+--------------+
| SERVER_ID | IP_PORT                                                          | MATCH_INDEX | NEXT_INDEX | ROLE     | HAS_VOTED | FORCE_SYNC | ELECTION_WEIGHT | LEARNER_SOURCE | APPLIED_INDEX | PIPELINING | SEND_APPLIED |
+-----------+------------------------------------------------------------------+-------------+------------+----------+-----------+------------+-----------------+----------------+---------------+------------+--------------+
|         1 | my-wesqlscale-cluster-mysql-0.my-wesqlscale-cluster-mysql-headle |           0 |         78 | Follower | No        | No         |               5 |              0 |             0 | Yes        | No           |
|         2 | my-wesqlscale-cluster-mysql-1.my-wesqlscale-cluster-mysql-headle |          78 |          0 | Leader   | Yes       | No         |               5 |              0 |            77 | No         | No           |
|         3 | my-wesqlscale-cluster-mysql-2.my-wesqlscale-cluster-mysql-headle |          78 |         79 | Follower | Yes       | No         |               5 |              0 |            77 | Yes        | No           |
+-----------+------------------------------------------------------------------+-------------+------------+----------+-----------+------------+-----------------+----------------+---------------+------------+--------------+

Use Case(s)

No response

earayu commented 1 year ago

should fix testcase: _TestGatewayBufferingWhenPrimarySwitchesServingState

earayu commented 1 year ago

_TestGatewayBufferingWhileReparenting

earayu commented 1 year ago

Next, I will implement this feature.

earayu commented 1 year ago

see also: https://vitess.io/docs/16.0/reference/features/vtgate-buffering/

The implementation of Vitess is quite complicated due to its sharding architecture, which leads to a lot of redundant codes. In my opinion, it might be over-designed and lacks support for unplanned failover. I will start a new issue about this feature.

wesql / wescale

Feature Request: when the primary tablet is unreachable, instead of throwing error immediately, pause current SQL executation and wait for a timeout #69

Feature Description

Use Case(s)