[Jepsen] Recovery after network partitions seems to take a while

yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.

https://www.yugabyte.com

Other

8.96k stars 1.07k forks source link

[Jepsen] Recovery after network partitions seems to take a while #821

Closed aphyr closed 5 years ago

aphyr commented 5 years ago

This may not be a real issue in YB; I'm still exploring behavior and collecting more evidence!

On Yugabyte CE, version 1.1.10.0, recovery from network partitions seems to take a little longer than one might expect. In particular, when performing a series of single-row CQL inserts during a mixture of randomized 3/2 split network partitions, followed by the end of all partitions, 30 seconds for the cluster to heal, and a final read of all rows, that read consistently times out.

This cluster is running on 5 nodes, with replication factor 3. n1, n2, and n3 are masters; n1, n2, n3, n4, and n5 are tservers.

For instance, in this test, the second partition (the shaded region in this plot) appears to prevent all clients from performing writes, and even with 30 seconds of recovery before the final read, that read times out:

latency-raw 64

This may be expected behavior, or might be some sort of client issue, so I'm going to keep exploring. I'll update this issue as I find out more. :)

aphyr commented 5 years ago

This is a slightly different workload which performs single-row inserts and all-row reads throughout the test, instead of waiting for recovery before doing a read. When we schedule partitions in long-running patterns, it looks like it takes about 20 seconds to fully recover:

latency-raw 66 latency-raw 68

However, that same workload with 30s/30s healthy/partitioned cycles results in a cluster which never appears to recover to a healthy state. I'm wondering if more rapid (~30s) changes in network topology can knock YB into a confused state that takes longer to recover from...

latency-raw 65

aphyr commented 5 years ago

Recovery times are variable, and there does appear to be a relationship between frequency and recovery time--55+ second partitions seem reliably recoverable, ~40s- partitions seem to prevent recovery. Here's 70, 60, 50, 40, 30, 20, 10-second partitions in sequence:

latency-raw 73

50 seconds seems like it can go either way:

latency-raw 74

latency-raw 75

aphyr commented 5 years ago

Conversely, repeated patterns of 5-second partitions seem quickly recoverable? I'm utterly lost here.

latency-raw 76

aphyr commented 5 years ago

I'm starting to suspect that there's some sort of state you can get a cluster into that is harder to get out of than normal, which is why we're seeing variable results in these tests. Consider, for instance, these three 1000-second tests with 20-second cycles of partitions followed by 60 second windows for recovery. Most of the time the cluster recovers OK, but sometimes it seems to get wedged for a good long while!

latency-raw 82 latency-raw 81 latency-raw 80

Here's the full logs for that last test: http://aphyr.com/media/20190131T124802.000-0500.zip

amitanandaiyer commented 5 years ago

Looking into http://aphyr.com/media/20190131T124802.000-0500.zip

I see a lot of Error opening client's:

grepped for "worker 0|nemesis." in jepsen.log

 2019-01-31 13:00:15,566{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-01-31 13:00:15,668{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n5 #{n2 n1 n3}, n4 #{n2 n1 n3}, n2 #{n5 n4}, n1 #{n5 n4}, n3 #{n5 n4}}]
 2019-01-31 13:00:17,100{GMT}  WARN  [jepsen worker 0] jepsen.core: Error opening client
 2019-01-31 13:00:17,101{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :fail :add  446 [:no-client All host(s) tried for query failed (tried: n1/192.168.122.11:9042 (com.datastax.driver.core.exceptions.   OperationTimedOutException: [n1/192.168.122.11:9042] Operation timed out))]
 2019-01-31 13:00:17,111{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :invoke :add  448
 2019-01-31 13:00:17,111{GMT}  INFO  [jepsen worker 0] com.datastax.driver.core.ClockFactory: Using native clock to generate timestamps.
 2019-01-31 13:00:31,435{GMT}  WARN  [jepsen worker 0] jepsen.core: Error opening client
 2019-01-31 13:00:31,435{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :fail :add  448 [:no-client All host(s) tried for query failed (tried: n1/192.168.122.11:9042 (com.datastax.driver.core.exceptions.   OperationTimedOutException: [n1/192.168.122.11:9042] Operation timed out))]
 2019-01-31 13:00:31,446{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :invoke :add  450
 2019-01-31 13:00:31,446{GMT}  INFO  [jepsen worker 0] com.datastax.driver.core.ClockFactory: Using native clock to generate timestamps.
 2019-01-31 13:00:35,668{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop nil
 2019-01-31 13:00:35,875{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop :network-healed
 2019-01-31 13:00:45,769{GMT}  WARN  [jepsen worker 0] jepsen.core: Error opening client
 2019-01-31 13:00:45,769{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :fail :add  450 [:no-client All host(s) tried for query failed (tried: n1/192.168.122.11:9042 (com.datastax.driver.core.exceptions.   OperationTimedOutException: [n1/192.168.122.11:9042] Operation timed out))]
 2019-01-31 13:00:45,786{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :invoke :add  452
 2019-01-31 13:00:45,786{GMT}  INFO  [jepsen worker 0] com.datastax.driver.core.ClockFactory: Using native clock to generate timestamps.
 2019-01-31 13:00:55,875{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-01-31 13:00:55,977{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n2 #{n5 n4 n3}, n1 #{n5 n4 n3}, n5 #{n2 n1}, n4 #{n2 n1}, n3 #{n2 n1}}]
 2019-01-31 13:01:00,116{GMT}  WARN  [jepsen worker 0] jepsen.core: Error opening client
 2019-01-31 13:01:00,116{GMT}  INFO  [jepsen worker 0] jepsen.util: 25 :fail :add  452 [:no-client All host(s) tried for query failed (tried: n1/192.168.122.11:9042 (com.datastax.driver.core.exceptions.

So, it looks like a lot of these requests may not even be hitting the YB Servers. The stack trace when we run into such an error seems to be something like

2019-01-31 12:50:00,448{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: n2/192.168.122.12:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [n2/192.   168.122.12:9042] Operation timed out))
   at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:255) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:102) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect(Cluster.java:1623) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1541) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.init(Cluster.java:160) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:331) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:306) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.connect(Cluster.java:248) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at clojurewerkz.cassaforte.client$connect_or_close.invokeStatic(client.clj:187) ~[na:na]
   at clojurewerkz.cassaforte.client$connect_or_close.doInvoke(client.clj:181) ~[na:na]
   at clojure.lang.RestFn.invoke(RestFn.java:410) [clojure-1.10.0.jar:na]
   at clojurewerkz.cassaforte.client$connect.invokeStatic(client.clj:203) ~[na:na]
   at clojurewerkz.cassaforte.client$connect.invoke(client.clj:192) ~[na:na]
   at yugabyte.client$connect.invokeStatic(client.clj:26) ~[na:na]
   at yugabyte.client$connect.invoke(client.clj:20) ~[na:na]
   at yugabyte.client$connect.invokeStatic(client.clj:24) ~[na:na]
   at yugabyte.client$connect.invoke(client.clj:20) ~[na:na]
   at yugabyte.set.CQLSetClient.open_BANG_(set.clj:17) ~[na:na]
   at jepsen.core.ClientWorker$fn__5610.invoke(core.clj:317) ~[classes/:na]
   at jepsen.core.ClientWorker.run_worker_BANG_(core.clj:315) [classes/:na]
   at jepsen.core$run_workers_BANG_$run__5575.invoke(core.clj:187) [classes/:na]
   at dom_top.core$real_pmap$build_thread__223$fn__224.invoke(core.clj:166) [classes/:na]
   at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.10.0.jar:na]
   at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.10.0.jar:na]
   at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.0.jar:na]
   at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) [clojure-1.10.0.jar:na]
   at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) [clojure-1.10.0.jar:na]
   at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.10.0.jar:na]
   at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.10.0.jar:na]
   at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.10.0.jar:na]
   at clojure.core$apply.invokeStatic(core.clj:669) [clojure-1.10.0.jar:na]
   at clojure.core$bound_fn_STAR_$fn__5734.doInvoke(core.clj:2003) [clojure-1.10.0.jar:na]

@aphyr does the partition affect the communication between the jepsen client and the yb-servers at all? or is it fair to expect that those communication links should be open always, and the only links that are broken/mended are the ones between the yb-servers themselves?

This issue aside, we expect that once the network is healed. Each RAFT group will perform its own election; be able to elect a leader; and make futher progress. Ideally, this should happen within 10 secs or so. However, @bmatican pointed out that the election exponential back-off is currently capped at 20 sec. (--leader_failure_exp_backoff_max_delta_ms=20000) so it may be possible that some of the nodes decide to back off up to this limit; thus causing the respective tablet to be unavailable for up to 20sec.
I see that sometimes when there is a partition; the minority node seems to continously call for elections and increase its term in comarision to the other two nodes. When the network is healed; it would result in the leader stepping down and calling for a new election because it has seen a higher term. Ideally, I'd expect to see that this happens immediately after the network has healed; however, In the logs (on a 4-cpu machine repro) I'm noticing that this is happening sometimes much later; This may also be a cause of some operation-timeout/write-fail errors seen by the client as long as 20sec later.

aphyr commented 5 years ago

Yes, I've also been wondering about this! It's odd that we get timeouts during the connection process, right? This isn't something Jepsen does: we only interfere with network connections between database nodes, not from the clients to the nodes. I can think of a few ways this might happen:

The Cassandra Java client somehow squirrels away state about node availability in a global somewhere, and even though we're standing up completely fresh clients, thinks that it's not worth even trying to connect to nodes because it thinks they're likely to be down.
We're opening a TCP connection to the server, but instead of CONNREFUSED, we're hanging somewhere in that initial handshake or protocol negotation? I think this is maybe our most likely option given the stacktrace from the client, but I haven't read that part of the Cassandra client code yet.
We are able to open a connection, but before the Java client connects, it tries to make some sort of query--for instance, discovering the full set of cluster members, and it's that query that's timing out.

20 seconds would explain some of the long recovery times we've seen (though some are longer!). I'll try tuning that down.

amitanandaiyer commented 5 years ago

Yeah, tuning --leader_failure_exp_backoff_max_delta_ms should help address issue 2). in https://github.com/YugaByte/yugabyte-db/issues/821#issuecomment-462109656

As for issue 3). in https://github.com/YugaByte/yugabyte-db/issues/821#issuecomment-462109656 Kudu/Raft has an enhancement called pre-elections that prevents the minority partition from bumping up their terms -- there by reducing the election churn. I'm working on porting that over to Yugabyte in https://github.com/YugaByte/yugabyte-db/issues/604 Will expedite my effort on that front, it should also help mitigate this "late elections" to some extent.

aphyr commented 5 years ago

Pre-elections miiiight help, but I feel like that shouldn't really cause major issues... Once the partition resolves, any request (either a vote req from the minority nodes to the majority ones, or an append-entries req from the majority leader to a minority node) will force the majority leader to step down, and snap its term forward to the higher minority term; it should immediately be able to go through an election and one of the majority nodes should win, since they likely have committed log entries above the minority nodes. I'm not sure why this scenario would result in multi-second election delays--but I know there's a lot more complexity involved here than basic Raft.

We also have to worry about membership changes too, right? So maybe this is something that gets worse when joint consensus is involved...

amitanandaiyer commented 5 years ago

Yeah. I agree once the majority leader steps down he should "immediately" be re-elected.

Ideally, that should probably be sub-second. But there could be some inefficiency in the process that's causing this to take 2 election rounds (abt 3-6sec).

That by itself is probably still ok. But, if the client timeout is < 6sec; it could cause an operation-timeout; and prompt the jepsen client to create a new client/re-connect and run into issue 1).

Will try to dig in a little more to see if I can identify the cause/fix issue 1. That's probably the main contributor to the jepsen-client not making progress.

amitanandaiyer commented 5 years ago

for the logs that I've looked at, I haven't seen any membership changes ..... But, if it were to happen, it may add some more corner cases to the equation -- but from a perf standpoint, config changes shouldn't have much effect here, and shouldn't cause additional timeouts to the jepsen client.

amitanandaiyer commented 5 years ago

I see this in one of the runs. Will look into what's happening on refresh Nodes.

 2019-02-09 06:39:53,486{GMT}  ERROR [cluster9-worker-1] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OperationTimedOutException: [n4/192.168.122.14:9042] Operation timed out
   at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476) ~[guava-19.0.jar:na]
   at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:455) ~[guava-19.0.jar:na]
   at com.datastax.driver.core.ControlConnection.refreshNodeListAndTokenMap(ControlConnection.java:582) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.ControlConnection.refreshNodeListAndTokenMap(ControlConnection.java:376) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster$Manager$NodeListRefreshRequestDeliveryCallback$1.runMayThrow(Cluster.java:2894) [cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32) [cassandra-driver-core-3.2.0-yb-19.jar:na]
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_201]
   at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108) [guava-19.0.jar:na]
   at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) [guava-19.0.jar:na]
   at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) [guava-19.0.jar:na]
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_201]
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_201]
   at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [netty-common-4.0.44.Final.jar:4.0.44.Final]
   at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
 Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [n4/192.168.122.14:9042] Operation timed out
   at com.datastax.driver.core.DefaultResultSetFuture.onTimeout(DefaultResultSetFuture.java:228) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1406) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:625) ~[netty-common-4.0.44.Final.jar:4.0.44.Final]
   at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:700) ~[netty-common-4.0.44.Final.jar:4.0.44.Final]
   at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:428) ~[netty-common-4.0.44.Final.jar:4.0.44.Final]
   ... 2 common frames omitted
 2019-02-09 06:39:54,801{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: n2/192.168.122.12:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [n2/192.   168.122.12:9042] Operation timed out))
   at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:255) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:102) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect(Cluster.java:1623) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1541) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.init(Cluster.java:160) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:331) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:306) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at com.datastax.driver.core.Cluster.connect(Cluster.java:248) ~[cassandra-driver-core-3.2.0-yb-19.jar:na]
   at yugabyte.client$connect.invokeStatic(client.clj:94) ~[classes/:na]
   at yugabyte.client$connect.invoke(client.clj:87) ~[classes/:na]
   at yugabyte.client$connect.invokeStatic(client.clj:91) ~[classes/:na]
   at yugabyte.client$connect.invoke(client.clj:87) ~[classes/:na]
   at yugabyte.set.CQLSetClient.open_BANG_(set.clj:16) ~[classes/:na]
   at jepsen.core.ClientWorker$fn__5622.invoke(core.clj:317) ~[jepsen-0.1.12-SNAPSHOT.jar:na]
   at jepsen.core.ClientWorker.run_worker_BANG_(core.clj:315) [jepsen-0.1.12-SNAPSHOT.jar:na]
   at jepsen.core$run_workers_BANG_$run__5587.invoke(core.clj:187) [jepsen-0.1.12-SNAPSHOT.jar:na]
   at dom_top.core$real_pmap$build_thread__223$fn__224.invoke(core.clj:166) [jepsen-0.1.12-SNAPSHOT.jar:na]
   at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.10.0.jar:na]
   at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.10.0.jar:na]
   at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.0.jar:na]
   at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) [clojure-1.10.0.jar:na]
   at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) [clojure-1.10.0.jar:na]
   at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.10.0.jar:na]
   at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.10.0.jar:na]

2019-02-09 06:39:05,257{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-02-09 06:39:05,702{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n4 #{n2 n5 n1}, n3 #{n2 n5 n1}, n2 #{n4 n3}, n5 #{n4 n3}, n1 #{n4 n3}}]
 2019-02-09 06:39:35,702{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop nil
 2019-02-09 06:39:35,911{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop :network-healed
 2019-02-09 06:39:53,486{GMT}  ERROR [cluster9-worker-1] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 2019-02-09 06:39:54,801{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:39:55,581{GMT}  ERROR [cluster9-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:39:58,680{GMT}  ERROR [cluster9-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:40:01,739{GMT}  ERROR [cluster9-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:40:03,914{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:35,912{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-02-09 06:40:36,420{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n2 #{n1 n4 n3}, n5 #{n1 n4 n3}, n1 #{n2 n5}, n4 #{n2 n5}, n3 #{n2 n5}}]
 2019-02-09 06:40:45,215{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:49,614{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:53,587{GMT}  ERROR [cluster10-worker-2] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 2019-02-09 06:40:53,980{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:55,686{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:40:56,691{GMT}  ERROR [cluster8-worker-3] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 2019-02-09 06:40:58,423{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:58,785{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:40:58,790{GMT}  ERROR [cluster8-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:01,885{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:01,890{GMT}  ERROR [cluster8-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:02,931{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:04,985{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:06,420{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop nil
 2019-02-09 06:41:06,636{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop :network-healed
 2019-02-09 06:41:07,369{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:08,085{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:11,186{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:11,751{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:14,286{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:16,077{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:17,386{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:20,485{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:20,535{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:23,586{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:24,925{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:26,685{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:29,374{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:37,539{GMT}  WARN  [jepsen worker 0] jepsen.core: Error opening client
 2019-02-09 06:41:38,264{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:42:06,636{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-02-09 06:42:06,844{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n2 #{n5 n1 n3}, n4 #{n5 n1 n3}, n5 #{n2 n4}, n1 #{n2 n4}, n3 #{n2 n4}}]
 2019-02-09 06:42:15,426{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client

amitanandaiyer commented 5 years ago

Merging relevant portions of jepsen.log; with time taken for querying system tables from

grep -B 4 "FROM system.peers" run20/n*/tserver/logs/*INFO* | grep took | sort -k 2

2019-02-09 06:37:54,411{GMT}  WARN  [cluster3-worker-0] com.datastax.driver.core.ControlConnection: Cannot find Cassandra version for host n1/192.168.122.11:9042 to parse the schema, using 2.1.0 based    on protocol version in use. If parsing the schema fails, this could be the cause
 2019-02-09 06:37:58,570{GMT}  WARN  [cluster3-worker-0] com.datastax.driver.core.ControlConnection: Cannot find Cassandra version for host n1/192.168.122.11:9042 to parse the schema, using 2.1.0 based    on protocol version in use. If parsing the schema fails, this could be the cause
 2019-02-09 06:37:58,573{GMT}  INFO  [cluster3-worker-0] com.datastax.driver.core.SchemaParser: Asked to rebuild TABLE jepsen_setup.waiting but I don't know keyspace jepsen_setup
 2019-02-09 06:38:00,873{GMT}  INFO  [jepsen nemesis] jepsen.core: Setting up nemesis
 2019-02-09 06:38:05,254{GMT}  INFO  [jepsen nemesis] jepsen.core: Running nemesis
 2019-02-09 06:39:05,257{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-02-09 06:39:05,702{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n4 #{n2 n5 n1}, n3 #{n2 n5 n1}, n2 #{n4 n3}, n5 #{n4 n3}, n1 #{n4 n3}}]
 2019-02-09 06:39:35,702{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop nil
 2019-02-09 06:39:35,911{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop :network-healed
 run20/n1/tserver/logs/yb-tserver.n1.root.log.INFO.20190209-063753.15610-W0209 06:39:52.697824 15765 cql_rpc.cc:271] CQL Call from 192.168.122.1:47450 took 1939ms. Details:
 run20/n5/tserver/logs/yb-tserver.n5.root.log.INFO.20190209-063750.13836-W0209 06:39:52.716047 14028 cql_rpc.cc:271] CQL Call from 192.168.122.1:49370 took 1161ms. Details:
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:39:52.744238 16092 cql_rpc.cc:271] CQL Call from 192.168.122.1:52612 took 2248ms. Details:
 2019-02-09 06:39:53,486{GMT}  ERROR [cluster9-worker-1] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 run20/n4/tserver/logs/yb-tserver.n4.root.log.INFO.20190209-063750.13860-W0209 06:39:53.426870 13965 cql_rpc.cc:271] CQL Call from 192.168.122.1:36258 took 2013ms. Details:
 2019-02-09 06:39:54,801{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:39:55,581{GMT}  ERROR [cluster9-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:39:58,680{GMT}  ERROR [cluster9-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:40:01,739{GMT}  ERROR [cluster9-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:40:01.729377 16248 cql_rpc.cc:271] CQL Call from 192.168.122.1:52629 took 2121ms. Details:
 2019-02-09 06:40:03,914{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:35,912{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  nil
 2019-02-09 06:40:36,420{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :start  [:isolated {n2 #{n1 n4 n3}, n5 #{n1 n4 n3}, n1 #{n2 n5}, n4 #{n2 n5}, n3 #{n2 n5}}]
 2019-02-09 06:40:45,215{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:49,614{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:40:50.922652 16093 cql_rpc.cc:271] CQL Call from 192.168.122.1:52654 took 10012ms. Details:
 2019-02-09 06:40:53,587{GMT}  ERROR [cluster10-worker-2] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 2019-02-09 06:40:53,980{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:55,686{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:40:55.315099 16092 cql_rpc.cc:271] CQL Call from 192.168.122.1:52657 took 10007ms. Details:
 2019-02-09 06:40:56,691{GMT}  ERROR [cluster8-worker-3] com.datastax.driver.core.ControlConnection: [Control connection] Unexpected error while refreshing node list and token map
 run20/n3/tserver/logs/yb-tserver.n3.root.log.INFO.20190209-063753.15936-W0209 06:40:57.100723 16012 cql_rpc.cc:271] CQL Call from 192.168.122.1:51509 took 2508ms. Details:
 2019-02-09 06:40:58,423{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:40:58,785{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:40:58,790{GMT}  ERROR [cluster8-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:40:59.682726 16087 cql_rpc.cc:271] CQL Call from 192.168.122.1:52660 took 10007ms. Details:
 2019-02-09 06:41:01,885{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 2019-02-09 06:41:01,890{GMT}  ERROR [cluster8-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 run20/n5/tserver/logs/yb-tserver.n5.root.log.INFO.20190209-063750.13836-W0209 06:41:01.562726 14030 cql_rpc.cc:271] CQL Call from 192.168.122.1:49370 took 10007ms. Details:
 2019-02-09 06:41:02,931{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:04,985{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:41:04.123623 16095 cql_rpc.cc:271] CQL Call from 192.168.122.1:52664 took 10004ms. Details:
 2019-02-09 06:41:06,420{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop nil
 2019-02-09 06:41:06,636{GMT}  INFO  [jepsen nemesis] jepsen.util: :nemesis  :info :stop :network-healed
 2019-02-09 06:41:07,369{GMT}  WARN  [jepsen worker 1] jepsen.core: Error opening client
 2019-02-09 06:41:08,085{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
 run20/n2/tserver/logs/yb-tserver.n2.root.log.INFO.20190209-063753.16013-W0209 06:41:08.634827 16248 cql_rpc.cc:271] CQL Call from 192.168.122.1:52670 took 10009ms. Details:
 2019-02-09 06:41:11,186{GMT}  ERROR [cluster10-reconnection-0] com.datastax.driver.core.ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds

It looks like most of the queries that result in an "Error opening client" are preceeded by a query to a system table, which takes > 1sec.

We seem to be setting the Connection timeout in yugabyte/src/client.clj to 1sec.

     (withSocketOptions (.. (SocketOptions.)
                          (setConnectTimeoutMillis 1000)
                          (setReadTimeoutMillis 2000)))

We should bump it up to 1 + 6sec to allow for the time to make one CQL call.

amitanandaiyer commented 5 years ago

@aphyr I believe increasing the setConnectTimeoutMillis to be something larger than the timeout we allow for the CQL queries should help us get past issue 1.) running into "Error opening client"

aphyr commented 5 years ago

We lowered that timeout so that we don't get stuck waiting indefinitely for YB to start doing operations again--previous tests weren't making many requests during partitions because of high timeouts. We can raise it here, but... does it feel like maybe it shouldn't take six seconds to do a query that every Cassandra client has to do before anything else? That seems weirdly high...

amitanandaiyer commented 5 years ago

I'm ok with lower timeouts if we want that. My concern is that if we expect to do a read to CQL as part of opening the connection; we should probably keep the connectTimeout to be slightly higher than the timeout for the read.

But yeah. it shouldn't take six seconds. Unless if we are in a partition ...

~~Here's a break down from one of the times when the network was back in good state:~~

W0209 06:41:08.634827 16248 cql_rpc.cc:271] CQL Call from 192.168.122.1:52670 took 10009ms. Details:
 W0209 06:41:08.634898 16248 cql_rpc.cc:274] cql_details {
   type: "QUERY"
   call_details {
     sql_string: "SELECT * FROM system.peers"
   }
 }
 Trace:
 W0209 06:41:08.634912 16248 cql_rpc.cc:275] 0209 06:40:58.625681 (+     0us) inbound_call.cc:86] Created InboundCall
 0209 06:40:58.625683 (+     2us) service_pool.cc:141] Inserting onto call queue
 0209 06:40:58.625716 (+    33us) service_pool.cc:212] Handling call
 0209 06:40:58.625716 (+     0us) cql_service.cc:122] Handling the CQL call
 0209 06:40:58.626088 (+   372us) executor.cc:1847] Apply
 0209 06:40:58.626115 (+    27us) executor.cc:1271] Flush Async
 0209 06:41:08.634803 (+10008688us) inbound_call.cc:172] Queueing success response
 Related trace:
 0209 06:40:58.626130 (+     0us) async_rpc.cc:557] ReadRpc initiated to 0181dc987cc946d5a4d0a053dd091b9a
 0209 06:40:58.626136 (+     6us) async_rpc.cc:132] SendRpc() called.
 0209 06:40:58.626140 (+     4us) async_rpc.cc:627] SendRpcToTserver
 0209 06:40:58.626199 (+    59us) async_rpc.cc:632] RpcDispatched Asynchronously
 0209 06:41:01.127509 (+2501310us) tablet_rpc.cc:218] Done(OK)
 0209 06:41:01.129791 (+  2282us) async_rpc.cc:132] SendRpc() called.
 0209 06:41:01.129808 (+    17us) async_rpc.cc:627] SendRpcToTserver
 0209 06:41:01.129880 (+    72us) async_rpc.cc:632] RpcDispatched Asynchronously
 0209 06:41:03.630587 (+2500707us) tablet_rpc.cc:218] Done(OK)
 0209 06:41:03.636793 (+  6206us) async_rpc.cc:132] SendRpc() called.
 0209 06:41:08.628179 (+4991386us) tablet_rpc.cc:348] LookupTabletCb(Timed out: LookupByIdRpc(0181dc987cc946d5a4d0a053dd091b9a, 1) failed: timed out after deadline expired: GetTabletLocations RPC to 192.168.122.11:7100 timed out after 4.989s)                                             0209 06:41:08.634483 (+  6304us) tablet_rpc.cc:218] Done(Timed out: Read(tablet: 0181dc987cc946d5a4d0a053dd091b9a, num_ops: 1, num_attempts: 4, txn: 00000000-0000-0000-0000-000000000000) passed its deadline 37874.891s (now: 37874.899s): Timed out (yb/rpc/outbound_call.cc:439):         LookupByIdRpc(0181dc987cc946d5a4d0a053dd091b9a, 1) failed: timed out after deadline expired: GetTabletLocations RPC to 192.168.122.11:7100 timed out after 4.989s)
 0209 06:41:08.634632 (+   149us) async_rpc.cc:762] ProcessResponseFromTserver(Timed out: Failed Read(tablet: 0181dc987cc946d5a4d0a053dd091b9a, num_ops: 1, num_attempts: 4, txn: 00000000-0000-0000-0000-000000000000) to tablet 0181dc987cc946d5a4d0a053dd091b9a (no tablet server           available) after 4 attempt(s): Read(tablet: 0181dc987cc946d5a4d0a053dd091b9a, num_ops: 1, num_attempts: 4, txn: 00000000-0000-0000-0000-000000000000) passed its deadline 37874.891s (now: 37874.899s): Timed out (yb/rpc/outbound_call.cc:439):                                              LookupByIdRpc(0181dc987cc946d5a4d0a053dd091b9a, 1) failed: timed out after deadline expired: GetTabletLocations RPC to 192.168.122.11:7100 timed out after 4.989s)
 Related trace:
 0209 06:40:58.626144 (+     0us) outbound_call.cc:184] Outbound Call initiated.
 0209 06:40:58.626198 (+    54us) reactor.cc:690] Scheduled.
 0209 06:40:58.626371 (+   173us) outbound_call.cc:390] Queued.
 0209 06:40:58.626383 (+    12us) outbound_call.cc:401] Call Sent.
 0209 06:41:01.127473 (+2501090us) outbound_call.cc:433] Call TimedOut.
 Related trace:
 0209 06:41:01.129813 (+     0us) outbound_call.cc:184] Outbound Call initiated.
 0209 06:41:01.129879 (+    66us) reactor.cc:690] Scheduled.
 0209 06:41:01.129941 (+    62us) outbound_call.cc:390] Queued.
 0209 06:41:01.129954 (+    13us) outbound_call.cc:401] Call Sent.
 0209 06:41:03.630488 (+2500534us) outbound_call.cc:433] Call TimedOut.

aphyr commented 5 years ago

I think connectTimeout here refers to the TCP connect timeout, so raising it won't make a difference if it's the query that's slow. It... is weird that it makes a query as a part of the connection process, but that just seems to be how the Cassandra client is designed. Wonder if we can turn that off somehow... I should go digging in that code.

amitanandaiyer commented 5 years ago

I'm investigating why queries to the master might take long.

It looks like nemesis just healed at 06:41:06 so the trace I pasted above is probably a bad example. 8sec of it was during partition, so probably it taking long is ok.

Will try to find another case where it takes long during good connectivity.

amitanandaiyer commented 5 years ago

there seems to be a refreshNodeListIntervalMillis which defaults to 1sec in the Cassandra client.

We could probably set it to something much larger, if we want to disable node discovery. From what I understand, we want queries from the jepsen client to go only to the node that we connect to, and do not want the jepsen-client to forward the request to a different node (which may be "local" for that key/tablet)

amitanandaiyer commented 5 years ago

It looks like even if querying the system table takes > 1sec; it doesn't caus an issue unless it is > 2sec.

I'm thinking that it is probably setReadTimeoutMillis being 2000 which accounts for the connection being reset; probably more than the connectTimeout.

I thought that the timeout used is 10sec or 12 sec.

amitanandaiyer commented 5 years ago

While these account for temporary/unnecessary glitches; it would not account for your figure 3; where things are consistently down. If the masters are in good condition; eventually querying the system tablets should become O(x ms) and things should get back to the system.

In the logs you uploaded. It looks like the proxies are busy with meta_cache.cc repeatedly querying the masters without any success.

W0131 17:50:57.875892  8578 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.876253  8580 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.876624  8579 meta_cache.cc:562] Leader Master has changed, re-trying...
 I0131 17:50:57.877018  8576 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.877044  8574 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.877104  8575 meta_cache.cc:548] Got resp error 7, code=OK
 W0131 17:50:57.877344  8576 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.877725  8574 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.878124  8575 meta_cache.cc:562] Leader Master has changed, re-trying...
 I0131 17:50:57.878509  8578 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.878546  8579 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.878638  8580 meta_cache.cc:548] Got resp error 7, code=OK
 W0131 17:50:57.878947  8578 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.879328  8579 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.879743  8580 meta_cache.cc:562] Leader Master has changed, re-trying...
 I0131 17:50:57.880219  8574 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.880244  8575 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.880264  8576 meta_cache.cc:548] Got resp error 7, code=OK
 W0131 17:50:57.880551  8574 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.880844  8575 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.881136  8576 meta_cache.cc:562] Leader Master has changed, re-trying...
 I0131 17:50:57.881619  8578 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.881652  8580 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.881695  8579 meta_cache.cc:548] Got resp error 7, code=OK
 W0131 17:50:57.882009  8578 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.882387  8580 meta_cache.cc:562] Leader Master has changed, re-trying...
 W0131 17:50:57.882691  8579 meta_cache.cc:562] Leader Master has changed, re-trying...
 I0131 17:50:57.883421  8574 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.883433  8576 meta_cache.cc:548] Got resp error 7, code=OK
 I0131 17:50:57.883488  8575 meta_cache.cc:548] Got resp error 7, code=OK
 W0131 17:50:57.883810  8574 meta_cache.cc:562] Leader Master has changed, re-trying...

Typically, the master leader should respond with a successful response. But it looks like all 3 masters respond with NOT_THE_LEADER even though the leader election was successful; and one of them is/was currently the leader.

The status NOT_THE_LEADER seems to be also used in the response if catalog_status_ is not OK, which seems to be the case. .. At least on 2 master nodes.

:!grep catalog_manager.cc:678 n*/master/logs/*INFO*
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:48:11.617588  8532 catalog_manager.cc:678] Catalog manager background task thread going to sleep: Service unavailable (yb/master/catalog_manager.cc:6789): Catalog manager is not initialized. State: 1
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:48:11.604938 37157 catalog_manager.cc:678] Catalog manager background task thread going to sleep: Service unavailable (yb/master/catalog_manager.cc:6789): Catalog manager is not initialized. State: 1

my guess is that the issue in the 3rd figure that you ran into is essentially caused by the catalog_manager .... this makes all the system.xyz table operations hang/not-succeed.

cassandra client might intrepret this failure to not allow the jepsen-clients to even make queries to other (normal) tables.

amitanandaiyer commented 5 years ago

@bmatican noticed that the service queues at the masters on your 3rd run have been running over .... so that would explain why the tservers did not hear back from the masters. Causing the system to not respond.

21:54 $ grep "dropped due to backpressure" n*/master/logs/*INFO* | sort -k 2
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:50:38.886206 37097 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:56908 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:51:58.985641 37104 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.14:35720 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:51:59.006337 48634 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.14:56364 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:51:59.006337 48642 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:54536 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:52:39.461382 37102 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.14:35670 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:52:41.993408 48635 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:54574 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:42.021397  8476 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:38526 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:42.021397  8485 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:32896 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:42.021402  8472 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:38430 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:42.021404  8483 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:38466 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:42.021405  8475 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:32956 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:42.021410  8487 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:33012 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:52:43.786003 37109 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:50372 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:52:46.859905 37102 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:50254 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:52:53.012830  8484 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:32948 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:53:19.625246  8475 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:33076 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:53:19.625246  8478 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:34004 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:53:19.625247  8479 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:46558 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:53:19.625248  8481 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:32876 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:53:25.777230  8484 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:32948 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:54:40.009388 37104 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:57022 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:56:00.649279 37106 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:57056 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:56:00.649523 48639 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:54656 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:57:20.777384  8478 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:34144 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:57:20.777384  8482 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:46718 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:58:01.224568  8473 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:47734 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:58:01.481781 48632 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:54486 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:58:01.486490 37102 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:57830 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:58:05.072430 48631 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:54450 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:58:05.079987 37099 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:57858 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:58:35.785995  8479 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:46558 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 17:58:41.419513 48636 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:46956 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:58:41.422143 37109 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.14:35636 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 17:58:41.422147 37105 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:56928 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 17:58:41.673933  8477 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:46616 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 18:00:01.801544 37106 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:57056 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 18:01:22.441833 37107 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:50226 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 18:01:22.441833 37112 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:51392 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 18:02:36.938232 48640 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.11:46952 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 18:02:42.825561 37112 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.12:51392 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 18:02:42.825572  8484 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.14:54590 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 18:02:42.826361 48639 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:34684 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 18:02:46.667747 48644 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.13:34760 dropped due to backpressure. The service queue is full, it has 1000 items.
n2/master/logs/yb-master.n2.root.log.INFO.20190131-174811.48621:W0131 18:05:23.593616 48642 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:54536 dropped due to backpressure. The service queue is full, it has 1000 items.
n3/master/logs/yb-master.n3.root.log.INFO.20190131-174811.37087:W0131 18:05:23.593915 37099 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.15:57858 dropped due to backpressure. The service queue is full, it has 1000 items.
n1/master/logs/yb-master.n1.root.log.INFO.20190131-174811.8462:W0131 18:05:23.850908  8487 service_pool.cc:169] GetMasterRegistration request on yb.master.MasterService from 192.168.122.14:54706 dropped due to backpressure. The service queue is full, it has 1000 items.

One simple fix would be to set the rpc_queue_limit/master_tserver_svc_queue_length to something larger. Also, it will be good to see what is it that is filling up the queue; and look into rate-limiting it to prevent the queue from filling up.

aphyr commented 5 years ago

Yep! connectTimeout controls the time it takes to open the TCP socket, but the Cassandra client connection process does more than that; it also does a synchronous read of the system table to find out what peers are available, and that query is subject to the readTimeout, which is configured as 2 seconds in our tests, not 12 seconds. We lowered the read timeout because 12s timeouts resulted in us making relatively few requests during partitions, and we'd like to move on and make new requests quickly, in the hopes that one of those requests might succeed.

When a client fails to open a connection like this, that's not the end of the world; we immediately close the client and try again, and as you observe, YugaByte DB does recover after some time. Since Cassandra clients make that query for peers as a part of their default connection process, I think it'd be worth making that query read from a local cache--I assume right now it's trying to talk over the network to other nodes, which means new clients are going to have trouble talking to YB during partitions. I can open a new ticket for that, if you think it'd be helpful!

Meta-cache and service queue issues... yeah! Those seem like useful things to address!

aphyr commented 5 years ago

I've done a little more experimenting on version 1.1.13.0-b2, and it looks like not only can YB get stuck for long durations given ongoing network partitions, but it can also get stuck for 5+ minutes even in the absence of failures. However, additional network partitions can kick the cluster into a working state again!

http://jepsen.io.s3.amazonaws.com/analyses/yugabyte-db-1.1.9/20190221T100947-stuck.zip

latency-raw 9

amitanandaiyer commented 5 years ago

This could again be caused by a leader/master getting stuck upon re-election;

if one master gets stuck … we won’t make progress …. indefinitely. Until another partition causes the leadership to move.

So here, I think 1 master got stuck with the issue …. when it got re-elected after the small partition, and new clients could not connect for 5+ minute. But the other 2 masters possibly did not run into this issue, we are able to make progress when the leadership flips over to one of them.

I have a fix for this issue. Once you have a build that includes 1ae0df3

amitanandaiyer commented 5 years ago

verified that this was indeed the case n1 got stuck while opening the catalog table.

n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.917145 18034 catalog_manager.cc:863] Loading table and tablet metadata into memory for term 1
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.917771 18034 catalog_manager.cc:964] RunLoaders: Loading tables into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.918423 18034 catalog_manager.cc:969] RunLoaders: Loading tablets into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.918795 18034 catalog_manager.cc:974] RunLoaders: Loading namespaces into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.919148 18034 catalog_manager.cc:980] RunLoaders: Loading user-defined types into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.919426 18034 catalog_manager.cc:986] RunLoaders: Loading cluster configuration into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.919623 18034 catalog_manager.cc:991] RunLoaders: Loading roles into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.919963 18034 catalog_manager.cc:997] RunLoaders: Loading Redis config into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:09:59.920140 18034 catalog_manager.cc:1003] RunLoaders: Loading sys config into memory.
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:10:59.641687 18804 raft_consensus.cc:787] T 00000000000000000000000000000000 P f229091c6caf42e98871ef79b4a70d1b [term 8 LEADER]: Becoming Leader. State: Replica: f229091c6caf42e98871ef79b4a70d1b, State: 1, Role: LEADER
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:10:59.649502 18805 catalog_manager.cc:863] Loading table and tablet metadata into memory for term 8
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:11:57.591372 19021 raft_consensus.cc:787] T 00000000000000000000000000000000 P f229091c6caf42e98871ef79b4a70d1b [term 15 LEADER]: Becoming Leader. State: Replica: f229091c6caf42e98871ef79b4a70d1b, State: 1, Role: LEADER
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:12:08.373796 19305 raft_consensus.cc:787] T 00000000000000000000000000000000 P f229091c6caf42e98871ef79b4a70d1b [term 17 LEADER]: Becoming Leader. State: Replica: f229091c6caf42e98871ef79b4a70d1b, State: 1, Role: LEADER
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:19:21.266049 21095 raft_consensus.cc:787] T 00000000000000000000000000000000 P f229091c6caf42e98871ef79b4a70d1b [term 31 LEADER]: Becoming Leader. State: Replica: f229091c6caf42e98871ef79b4a70d1b, State: 1, Role: LEADER
n1/master/logs/yb-master.n1.root.log.INFO.20190221-150956.17908:I0221 15:19:39.858561 21171 raft_consensus.cc:787] T 00000000000000000000000000000000 P f229091c6caf42e98871ef79b4a70d1b [term 33 LEADER]: Becoming Leader. State: Replica: f229091c6caf42e98871ef79b4a70d1b, State: 1, Role: LEADER

amitanandaiyer commented 5 years ago

Actually both n1 and n2 ran into this issue. But, n3 did not.

So we were able to make progress when n3 got reelected the leader.

kmuthukk commented 5 years ago

In our tests, the pre-elections improvement in https://github.com/YugaByte/yugabyte-db/commit/2b56a9607a9b6f764a7b7ef9a4a9b2beadd7abd2 along with fix for (#821) helps address this issue.

@aphyr - will leave the issue open till you have a chance to test with new build (once it is ready).

ttyusupov commented 5 years ago

There are options which I've used to make recovery faster:

--client_read_write_timeout_ms 2000
--leader_failure_max_missed_heartbeat_periods 2
--leader_failure_exp_backoff_max_delta_ms 5000
--rpc_default_keepalive_time_ms 5000
--rpc_connection_timeout_ms 1500

--rpc_default_keepalive_time_ms 5000 causes tserver to drop connections which are idle for more than 5 seconds and use newly created connections after the network is healed. Before we used this option there was an issue that requests which were sent to socket after the network is healed only were starting to appear at the destination node socket after about 10-20 seconds.

aphyr commented 5 years ago

In both 1.1.13.0-b2 and 1.1.15.0-b16, it looks like these tuning flags make recovery less reliable, and might introduce spurious failures, at least in my LXC cluster. This is the behavior with stock flags:

latency-raw 14

And with the suggested flags:

latency-raw 13

You can reproduce these by checking out Jepsen 9ea7c3e53b06947d743d4f2bed82dceb0ac088b2, and running:

lein run test -o debian --version 1.1.15.0-b16 --url https://s3-us-west-2.amazonaws.com/downloads.yugabyte.com/new/yugabyte-ee-1.1.15.0-b16-centos-x86_64.tar.gz --concurrency 4n --time-limit 500 --test-count 1 -w single-key-acid --nemesis-interval 60 --nemesis-schedule fixed --nemesis partition-half

lein run test -o debian --version 1.1.15.0-b16 --url https://s3-us-west-2.amazonaws.com/downloads.yugabyte.com/new/yugabyte-ee-1.1.15.0-b16-centos-x86_64.tar.gz --concurrency 4n --time-limit 500 --test-count 1 -w single-key-acid --nemesis-interval 60 --nemesis-schedule fixed --nemesis partition-half --experimental-tuning-flags

aphyr commented 5 years ago

OK, I've done a bunch more testing; recovery times on 1.1.15.0-b16 now look to be around ~25 seconds, and they're in general reliable about coming back after that time. Here's a typical example from Jepsen 337d5fde62839c631b896c862bb676140ea10d23, running

lein run test -o debian --version 1.1.15.0-b16 --url https://s3-us-west-2.amazonaws.com/downloads.yugabyte.com/new/yugabyte-ee-1.1.15.0-b16-centos-x86_64.tar.gz --concurrency 4n --test-count 5 -w set --nemesis-interval 30 --nemesis partition-half --nemesis-schedule fixed --time-limit 60; and-bell

latency-raw 18

kmuthukk commented 5 years ago

Thanks @aphyr for confirming.