yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.96k stars 1.07k forks source link

[Jepsen] Acknowledged inserts can be present in reads for tens of seconds, then disappear #824

Closed aphyr closed 5 years ago

aphyr commented 5 years ago

In Jepsen set tests, on a five-node debian cluster running 1.1.10.0, we can observe the apparent loss of acknowledged writes after randomized 3/2 network partitions. Writes succeed, are visible to every subsequent read operation, then, after some critical juncture, are missing from every subsequent read, despite that read observing all other records written before and after the missing element. The insertion of these elements, and the time they appear to go missing, appear correlated with the end of a network partition and the later recovery of the cluster to a healthy state.

latency-raw 83

For instance, in this test, element 427 is written successfully from 81.8 to 92.21 seconds, during which the nemesis heals a network partition. It is present in every subsequent read from 98.4 seconds to 141.1 seconds, when the cluster happens to be recovering from a network partition which ended about 15 seconds prior; it is missing from 142 seconds through the end of the test, at 185 seconds.

latency-raw 84

In this test, element 1063 is written successfully from 144.18 to 144.22 seconds, right when the cluster is recovering from a partition which ended ~15 seconds prior. Element 1063 is present immediately in subsequent reads at 144.24 seconds, and persists until 269 seconds, a read which completes immediately before the nemesis isolates n1 and n4 from n2, n3, and n5. The next read, at 303 seconds, fails to observe element 1063, and it remains missing for the remainder of the test. The final read occurs at 308 seconds.

latency-raw 85

In [this test](), element 5894 was inserted successfully from 300.55 to 300.57 seconds, just as the cluster is beginning to recover from a previous network partition. 5894 is present in subsequent reads at 304.2 seconds, and is last observed at 346.4 seconds, just prior to a partition isolating n3 and n5 from n1, n2, and n4. That partition ends at 406 seconds, and at 423 seconds, reads begin again--but none observe element 5894.

ttyusupov commented 5 years ago

Wasn't able to reproduce for 92 test runs with following settings:

GCP n1-standard-8 host with 5 LXC YugaByte nodes, OS: Debian Jessie
lein run test --os debian --version 1.1.10.0 --test set --time-limit 300 --concurrency 5 --nemesis partition-random-halves

Also wasn't able to reproduce for 20 test runs with the same settings on the following config:

Jepsen control node: GCP n1-standard-4 instance, OS: CentOS 7
5 YugaByte GCP n1-standard-2 instances, OS: Debian Jessie

But after I changed tserver flag load_balancer_max_concurrent_adds to 10 I was able to reproduce on both.

LXC:

results-sorted/ok/set/20190201T012307.000Z/jepsen.log
results-sorted/ok/set/20190201T012943.000Z/jepsen.log
results-sorted/ok/set/20190201T013558.000Z/jepsen.log
results-sorted/ok/set/20190201T014231.000Z/jepsen.log
results-sorted/ok/set/20190201T014852.000Z/jepsen.log
results-sorted/ok/set/20190201T015456.000Z/jepsen.log
results-sorted/ok/set/20190201T020130.000Z/jepsen.log
results-sorted/ok/set/20190201T020801.000Z/jepsen.log
**results-sorted/invalid/set/20190201T021432.000Z/jepsen.log**
results-sorted/ok/set/20190201T022101.000Z/jepsen.log
results-sorted/no-history/set/20190201T022744.000Z/jepsen.log
results-sorted/ok/set/20190201T022824.000Z/jepsen.log
results-sorted/ok/set/20190201T023454.000Z/jepsen.log
results-sorted/no-history/set/20190201T024054.000Z/jepsen.log
results-sorted/ok/set/20190201T024135.000Z/jepsen.log
results-sorted/ok/set/20190201T024758.000Z/jepsen.log
results-sorted/no-history/set/20190201T025427.000Z/jepsen.log
**results-sorted/invalid/set/20190201T025508.000Z/jepsen.log**
results-sorted/ok/set/20190201T030157.000Z/jepsen.log
results-sorted/ok/set/20190201T030820.000Z/jepsen.log
results-sorted/ok/set/20190201T031449.000Z/jepsen.log
results-sorted/ok/set/20190201T032111.000Z/jepsen.log
results-sorted/ok/set/20190201T032719.000Z/jepsen.log
results-sorted/ok/set/20190201T033342.000Z/jepsen.log
results-sorted/ok/set/20190201T034005.000Z/jepsen.log
results-sorted/ok/set/20190201T034641.000Z/jepsen.log
results-sorted/ok/set/20190201T035257.000Z/jepsen.log
**results-sorted/invalid/set/20190201T035926.000Z/jepsen.log**

GCP cluster:

results-sorted/ok/set/20190131T232727.000Z/jepsen.log
results-sorted/ok/set/20190131T233347.000Z/jepsen.log
results-sorted/ok/set/20190131T234015.000Z/jepsen.log
results-sorted/ok/set/20190131T234649.000Z/jepsen.log
results-sorted/ok/set/20190131T235301.000Z/jepsen.log
results-sorted/ok/set/20190131T235942.000Z/jepsen.log
results-sorted/ok/set/20190201T000648.000Z/jepsen.log
results-sorted/ok/set/20190201T001328.000Z/jepsen.log
results-sorted/ok/set/20190201T002005.000Z/jepsen.log
results-sorted/ok/set/20190201T002639.000Z/jepsen.log
results-sorted/ok/set/20190201T003306.000Z/jepsen.log
results-sorted/ok/set/20190201T003922.000Z/jepsen.log
results-sorted/ok/set/20190201T004558.000Z/jepsen.log
results-sorted/ok/set/20190201T005243.000Z/jepsen.log
**results-sorted/invalid/set/20190201T005914.000Z/jepsen.log**
results-sorted/ok/set/20190201T010536.000Z/jepsen.log
results-sorted/ok/set/20190201T011157.000Z/jepsen.log
results-sorted/ok/set/20190201T011828.000Z/jepsen.log
results-sorted/ok/set/20190201T012505.000Z/jepsen.log
results-sorted/ok/set/20190201T013139.000Z/jepsen.log
results-sorted/no-history/set/20190201T013800.000Z/jepsen.log
results-sorted/ok/set/20190201T013841.000Z/jepsen.log
results-sorted/ok/set/20190201T014454.000Z/jepsen.log
results-sorted/ok/set/20190201T015125.000Z/jepsen.log
results-sorted/ok/set/20190201T015802.000Z/jepsen.log
results-sorted/no-history/set/20190201T020423.000Z/jepsen.log
results-sorted/ok/set/20190201T020503.000Z/jepsen.log
results-sorted/ok/set/20190201T021134.000Z/jepsen.log
results-sorted/ok/set/20190201T021759.000Z/jepsen.log
results-sorted/ok/set/20190201T022355.000Z/jepsen.log
results-sorted/ok/set/20190201T023029.000Z/jepsen.log
results-sorted/ok/set/20190201T023709.000Z/jepsen.log
results-sorted/ok/set/20190201T024332.000Z/jepsen.log
results-sorted/no-history/set/20190201T024947.000Z/jepsen.log
results-sorted/ok/set/20190201T025028.000Z/jepsen.log
results-sorted/ok/set/20190201T025712.000Z/jepsen.log
results-sorted/ok/set/20190201T030322.000Z/jepsen.log
results-sorted/ok/set/20190201T030954.000Z/jepsen.log
results-sorted/ok/set/20190201T031634.000Z/jepsen.log
results-sorted/ok/set/20190201T032254.000Z/jepsen.log
**results-sorted/invalid/set/20190201T032923.000Z/jepsen.log**
results-sorted/ok/set/20190201T033545.000Z/jepsen.log
results-sorted/ok/set/20190201T034211.000Z/jepsen.log
results-sorted/ok/set/20190201T034836.000Z/jepsen.log
results-sorted/ok/set/20190201T035511.000Z/jepsen.log
results-sorted/ok/set/20190201T040134.000Z/jepsen.log
results-sorted/ok/set/20190201T040813.000Z/jepsen.log
results-sorted/ok/set/20190201T041448.000Z/jepsen.log
results-sorted/ok/set/20190201T042100.000Z/jepsen.log
results-sorted/ok/set/20190201T042725.000Z/jepsen.log
**results-sorted/invalid/set/20190201T043347.000Z/jepsen.log**
results-sorted/ok/set/20190201T044023.000Z/jepsen.log
results-sorted/ok/set/20190201T044704.000Z/jepsen.log
results-sorted/ok/set/20190201T045336.000Z/jepsen.log
results-sorted/ok/set/20190201T045954.000Z/jepsen.log
results-sorted/ok/set/20190201T050619.000Z/jepsen.log
results-sorted/ok/set/20190201T051242.000Z/jepsen.log
results-sorted/ok/set/20190201T051911.000Z/jepsen.log
results-sorted/ok/set/20190201T052553.000Z/jepsen.log
results-sorted/ok/set/20190201T053212.000Z/jepsen.log
results-sorted/ok/set/20190201T053843.000Z/jepsen.log
results-sorted/ok/set/20190201T054501.000Z/jepsen.log
results-sorted/ok/set/20190201T060216.000Z/jepsen.log
results-sorted/ok/set/20190201T060852.000Z/jepsen.log
results-sorted/ok/set/20190201T061536.000Z/jepsen.log
results-sorted/ok/set/20190201T062203.000Z/jepsen.log
results-sorted/ok/set/20190201T062846.000Z/jepsen.log
results-sorted/ok/set/20190201T063524.000Z/jepsen.log
results-sorted/ok/set/20190201T064205.000Z/jepsen.log
results-sorted/ok/set/20190201T064837.000Z/jepsen.log
results-sorted/ok/set/20190201T065512.000Z/jepsen.log
results-sorted/ok/set/20190201T070140.000Z/jepsen.log
**results-sorted/invalid/set/20190201T070741.000Z/jepsen.log**
results-sorted/ok/set/20190201T071339.000Z/jepsen.log
results-sorted/ok/set/20190201T072026.000Z/jepsen.log
results-sorted/ok/set/20190201T072704.000Z/jepsen.log
**results-sorted/invalid/set/20190201T073335.000Z/jepsen.log**
ttyusupov commented 5 years ago

Just invalid results schedule: LXC

8 test runs
results-sorted/invalid/set/20190201T021432.000Z/jepsen.log
8 test runs
results-sorted/invalid/set/20190201T025508.000Z/jepsen.log
9 test runs
results-sorted/invalid/set/20190201T035926.000Z/jepsen.log

GCP cluster:

13 test runs
results-sorted/invalid/set/20190201T005914.000Z/jepsen.log
25 test runs
results-sorted/invalid/set/20190201T032923.000Z/jepsen.log
9 test runs
results-sorted/invalid/set/20190201T043347.000Z/jepsen.log
21 test runs
results-sorted/invalid/set/20190201T070741.000Z/jepsen.log
3 test runs
results-sorted/invalid/set/20190201T073335.000Z/jepsen.log
kmuthukk commented 5 years ago

We have been able to repro this internally now, although still very very intermittently, by increasing load_balancer_max_concurrent_adds from 1 (default) to 10.

From the logs we have a theory on what's going on. This appears to happen when raft membership changes (such as during a load balancing operation) AND some other failure (such as a network partition) happen at the same time. Based on the logs, we have a theory on this, and are now in the process of implementing/validating the fix.

hectorgcr commented 5 years ago

When determining which OpIds have been replicated, we now only accept votes from peers that are VOTERS in the active config. Before this change we were accepting votes from any peer that was a member of the active config.