Open harunzengin opened 1 week ago
I dug deeper in this and it seems like there are 2 problems:
When all nodes go down, we successfully mark all the hosts as :down
. We try to establish a new control connection on any of the nodes continuously. After we have established a new control connection, we only send the Xandra.Cluster.Pool
the :discovered_hosts
, which provides only information about the topology of the cluster, but not any :up
or :down
information. This leads to us being connected to one of the nodes with the control connection, but all of the nodes being marked as :down
in the Xandra.Cluster.Pool
state. This is not difficult to fix, just emitting a :host_up
event when we're establishing a new control connection should fix the problem.
We rely on Cassandra Gossip to inform us about a host :up
or :down
event. When the Cassandra nodes start up timely close, it seems that gossip doesn't inform us (or the message gets somehow lost). The periodic topology refresh informs us about the cluster topology periodically, but there's no :host_up
or :host_down
information there, so we keep having those hosts as :down
in Xandra.Cluster.Pool
forever (unless there's individual host_up
or host_down
events). Not sure how to fix...
I think to solve the second problem, we might try to handle :discovered_peers
event differently when all of our nodes are marked as :down
. In that case, we should try to start new pools via Xandra.Cluster.Pool.maybe_start_pools/1
. If the nodes are really down, we already have the logic where we cannot establish a connection and mark the node as down. However if they are actually up, we'll mark them as :connected
. Thoughts @whatyouhide ?
We observed a couple of times that when all of the Cassandra nodes are down for a bit and back again,
Xandra
sometimes fails to reconnect.I have reproduced this and observed the following:
After all nodes are shut down, Control connection fails to connect to any of the nodes at first:
so we can establish a control connection when one node is up again, however, we don't seem to update our state properly and can't recover from that.