Xandra can't recover when all Nodes are down

harunzengin commented 1 week ago

We observed a couple of times that when all of the Cassandra nodes are down for a bit and back again, Xandra sometimes fails to reconnect.

I have reproduced this and observed the following:

After all nodes are shut down, Control connection fails to connect to any of the nodes at first:

Xandra.Telemetry.attach_debug_handler
Xandra.Telemetry.attach_default_handler   

...

16:23:26.185 [warning] Control connection failed to connect

16:23:26.229 [warning] Control connection failed to connect

16:23:26.273 [warning] Control connection failed to connect

16:23:27.319 [warning] Control connection failed to connect

16:23:27.364 [warning] Control connection failed to connect

16:23:27.411 [warning] Control connection failed to connect

16:23:28.457 [warning] Control connection failed to connect

16:23:28.503 [warning] Control connection failed to connect

16:23:28.548 [warning] Control connection failed to connect

16:23:29.596 [warning] Control connection failed to connect

16:23:29.701 [debug] Sent frame STARTUP with protocol Xandra.Protocol.V5 and requested options: %{"CQL_VERSION" => "3.4.5"}

16:23:29.747 [debug] Received frame AUTHENTICATE

16:23:30.731 [debug] Control connection established

16:23:30.732 [debug] Discovered peers: [%Xandra.Cluster.Host{address: {172, 28, 32, 20}, port: 9042, data_center: "faux", host_id: "d2738874-147d-48a6-b004-365a15f8f9d8", rack: "faux", release_version: "4.0.10", schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e", tokens: MapSet.new([...])}]
iex(7)> :sys.get_state(conn)
{:has_connected_once,
 %Xandra.Cluster.Pool{
   connection_options: [
     show_sensitive_data_on_connection_error: false,
     nodes: [{"127.0.0.1", 9042}],
     max_concurrent_requests_per_connection: 100,
     encryption: false,
     default_consistency: :one,
     connect_timeout: 5000,
     backoff_type: :rand_exp,
     backoff_max: 30000,
     backoff_min: 1000,
     atom_keys: false,
     authentication: {Xandra.Authenticator.Password, ...}
   ],
   contact_nodes: [
     {"cassandra-1-faux", 9042},
     {"cassandra-2-faux", 9042},
     {"cassandra-3-faux", 9042}
   ],
   autodiscovered_nodes_port: 9042,
   pool_supervisor: #PID<0.217.0>,
   control_connection: #PID<0.523.0>,
   refresh_topology_interval: 300000,
   load_balancing_module: Xandra.Cluster.LoadBalancingPolicy.Random,
   load_balancing_state: [
     {%Xandra.Cluster.Host{
        address: {172, 28, 32, 137},
        port: 9042,
        data_center: "faux",
        host_id: "a45041cf-76e0-476d-adea-5d838f99efbd",
        rack: "faux",
        release_version: "4.0.10",
        schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e",
        tokens: MapSet.new([...])
      }, :down},
     {%Xandra.Cluster.Host{
        address: {172, 28, 32, 60},
        port: 9042,
        data_center: "faux",
        host_id: "03fd1176-b61a-4bb7-9897-da651e629bdd",
        rack: "faux",
        release_version: "4.0.10",
        schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e",
        tokens: MapSet.new([...])
      }, :down},
     {%Xandra.Cluster.Host{
        address: {172, 28, 32, 20},
        port: 9042,
        data_center: "faux",
        host_id: "d2738874-147d-48a6-b004-365a15f8f9d8",
        rack: "faux",
        release_version: "4.0.10",
        schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e",
        tokens: MapSet.new([...])
      }, :down}
   ],
   target_pools: 2,
   pool_size: 1,
   sync_connect_alias: nil,
   name: nil,
   peers: %{
     {{172, 28, 32, 20}, 9042} => %{
       status: :down,
       host: %Xandra.Cluster.Host{
         address: {172, 28, 32, 20},
         port: 9042,
         data_center: "faux",
         host_id: "d2738874-147d-48a6-b004-365a15f8f9d8",
         rack: "faux",
         release_version: "4.0.10",
         schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e",
         tokens: MapSet.new([...])
       },
       pool_ref: nil,
       pool_pid: nil
     },
     {{172, 28, 32, 60}, 9042} => %{
       status: :down,
       host: %Xandra.Cluster.Host{
         address: {172, 28, 32, 60},
         port: 9042,
         data_center: "faux",
         host_id: "03fd1176-b61a-4bb7-9897-da651e629bdd",
         rack: "faux",
         release_version: "4.0.10",
         schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e",
         tokens: MapSet.new([...])
       },
       pool_ref: nil,
       pool_pid: nil
     },
     {{172, 28, 32, 137}, 9042} => %{
       status: :down,
       host: %Xandra.Cluster.Host{
         address: {172, 28, 32, 137},
         port: 9042,
         data_center: "faux",
         host_id: "a45041cf-76e0-476d-adea-5d838f99efbd",
         rack: "faux",
         release_version: "4.0.10",
         schema_version: "e50a5416-46eb-38ef-8959-bb99f5bed92e",
         tokens: MapSet.new([...])
       },
       pool_ref: nil,
       pool_pid: nil
     }
   },
   checkout_queue: nil,
   xandra_mod: Xandra.Cluster.ConnectionPool,
   control_conn_mod: Xandra.Cluster.ControlConnection
 }}

so we can establish a control connection when one node is up again, however, we don't seem to update our state properly and can't recover from that.

harunzengin commented 3 days ago

I dug deeper in this and it seems like there are 2 problems:

When all nodes go down, we successfully mark all the hosts as :down. We try to establish a new control connection on any of the nodes continuously. After we have established a new control connection, we only send the Xandra.Cluster.Pool the :discovered_hosts, which provides only information about the topology of the cluster, but not any :up or :down information. This leads to us being connected to one of the nodes with the control connection, but all of the nodes being marked as :down in the Xandra.Cluster.Pool state. This is not difficult to fix, just emitting a :host_up event when we're establishing a new control connection should fix the problem.
We rely on Cassandra Gossip to inform us about a host :up or :down event. When the Cassandra nodes start up timely close, it seems that gossip doesn't inform us (or the message gets somehow lost). The periodic topology refresh informs us about the cluster topology periodically, but there's no :host_up or :host_down information there, so we keep having those hosts as :down in Xandra.Cluster.Pool forever (unless there's individual host_up or host_down events). Not sure how to fix...

harunzengin commented 3 days ago

I think to solve the second problem, we might try to handle :discovered_peers event differently when all of our nodes are marked as :down. In that case, we should try to start new pools via Xandra.Cluster.Pool.maybe_start_pools/1. If the nodes are really down, we already have the logic where we cannot establish a connection and mark the node as down. However if they are actually up, we'll mark them as :connected. Thoughts @whatyouhide ?

whatyouhide / xandra

Xandra can't recover when all Nodes are down #373