whatyouhide / xandra

Fast, simple, and robust Cassandra/ScyllaDB driver for Elixir.
ISC License
406 stars 54 forks source link

Periodic refresh of a DOWN node causes routing to downed node #302

Closed harunzengin closed 1 year ago

harunzengin commented 1 year ago

Currently, when a node is down and we do a periodic topology refresh, we consider the node as UP and perform a healtcheck, and before we conclude the healthcheck, the new connection pool for the downed node is already being routed some queries, which causes noise in the logs due to connection failures. We should be waiting for the healtcheck to be completed before we route queries. See https://github.com/lexhide/xandra/issues/301

whatyouhide commented 1 year ago

@harunzengin I'm not sure how we can do that, because it's a chicken and egg problem, no? In order to perform the healthcheck, we need to start the pool, so that connections register in the registry. But if the node is down, should we start the pool at all? Maybe we can start the pool and just not route queries to it... Thoughts?

harunzengin commented 1 year ago

Yeah, it was our idea as well to maybe not route anything before the healthcheck is completed. And since the healthcheck is basically to wait a fixed amount of time n before looking at the Registry whether the process registered itself, we could maybe not accept any queries in the load balancing strategy for t>n?

whatyouhide commented 1 year ago

Maybe we should just mark pools as ready or not in the cluster, and then only pick pools that are up? Here.

Otherwise, we can think of marking the host as "up" in the load-balancing plan only once it passes the health check?

harunzengin commented 1 year ago

So in addition to :up and :down in the state, having a :ready, :not_ready state you mean?

Shall we introduce a new message {:healtcheck, :pass, host} and {:healtcheck, :fail, host} from control_connection to cluster then?

whatyouhide commented 1 year ago

@harunzengin I don't think we need a new state. After all, a host is "up" when we can route queries to it, right? Could we consider the host as added and "down" until we get the healthcheck message?

harunzengin commented 1 year ago

Currently, the control connection sends a :host_up message to the cluster whenever it thinks that a node is up, this can be from cassandra gossip, or from the system peers table. In either case, the cluster process starts the connection and sends the control connection a {:healthcheck, host} message. If the control connection thinks the node is down, it sends a :node_down message to the cluster.

Now, we need to change this so that the control connection sends sth. like a :host_init message instead of :host_up. The cluster initializes the connection, then sends a {:healthcheck, host} message to the control connection. The control connection sends a :host_up or :host_down message afterwards. Thoughts? @whatyouhide

whatyouhide commented 1 year ago

Yes exactly, that's what I was thinking of: using a new :host_reported_up (I like this name in particular) message. This would tell us that a host is in the peers table or reported as up, but then we'd only send :host_up or :host_down based on the healthcheck. Does this make sense?

harunzengin commented 1 year ago

Yeap, that makes sense. I can start working on this then 👍

whatyouhide commented 1 year ago

Awesome, thanks @harunzengin! 🙏