Closed aphyr closed 5 years ago
Wasn't able to reproduce for 92 test runs with following settings:
GCP n1-standard-8 host with 5 LXC YugaByte nodes, OS: Debian Jessie
lein run test --os debian --version 1.1.10.0 --test set --time-limit 300 --concurrency 5 --nemesis partition-random-halves
Also wasn't able to reproduce for 20 test runs with the same settings on the following config:
Jepsen control node: GCP n1-standard-4 instance, OS: CentOS 7
5 YugaByte GCP n1-standard-2 instances, OS: Debian Jessie
But after I changed tserver flag load_balancer_max_concurrent_adds
to 10
I was able to reproduce on both.
LXC:
results-sorted/ok/set/20190201T012307.000Z/jepsen.log
results-sorted/ok/set/20190201T012943.000Z/jepsen.log
results-sorted/ok/set/20190201T013558.000Z/jepsen.log
results-sorted/ok/set/20190201T014231.000Z/jepsen.log
results-sorted/ok/set/20190201T014852.000Z/jepsen.log
results-sorted/ok/set/20190201T015456.000Z/jepsen.log
results-sorted/ok/set/20190201T020130.000Z/jepsen.log
results-sorted/ok/set/20190201T020801.000Z/jepsen.log
**results-sorted/invalid/set/20190201T021432.000Z/jepsen.log**
results-sorted/ok/set/20190201T022101.000Z/jepsen.log
results-sorted/no-history/set/20190201T022744.000Z/jepsen.log
results-sorted/ok/set/20190201T022824.000Z/jepsen.log
results-sorted/ok/set/20190201T023454.000Z/jepsen.log
results-sorted/no-history/set/20190201T024054.000Z/jepsen.log
results-sorted/ok/set/20190201T024135.000Z/jepsen.log
results-sorted/ok/set/20190201T024758.000Z/jepsen.log
results-sorted/no-history/set/20190201T025427.000Z/jepsen.log
**results-sorted/invalid/set/20190201T025508.000Z/jepsen.log**
results-sorted/ok/set/20190201T030157.000Z/jepsen.log
results-sorted/ok/set/20190201T030820.000Z/jepsen.log
results-sorted/ok/set/20190201T031449.000Z/jepsen.log
results-sorted/ok/set/20190201T032111.000Z/jepsen.log
results-sorted/ok/set/20190201T032719.000Z/jepsen.log
results-sorted/ok/set/20190201T033342.000Z/jepsen.log
results-sorted/ok/set/20190201T034005.000Z/jepsen.log
results-sorted/ok/set/20190201T034641.000Z/jepsen.log
results-sorted/ok/set/20190201T035257.000Z/jepsen.log
**results-sorted/invalid/set/20190201T035926.000Z/jepsen.log**
GCP cluster:
results-sorted/ok/set/20190131T232727.000Z/jepsen.log
results-sorted/ok/set/20190131T233347.000Z/jepsen.log
results-sorted/ok/set/20190131T234015.000Z/jepsen.log
results-sorted/ok/set/20190131T234649.000Z/jepsen.log
results-sorted/ok/set/20190131T235301.000Z/jepsen.log
results-sorted/ok/set/20190131T235942.000Z/jepsen.log
results-sorted/ok/set/20190201T000648.000Z/jepsen.log
results-sorted/ok/set/20190201T001328.000Z/jepsen.log
results-sorted/ok/set/20190201T002005.000Z/jepsen.log
results-sorted/ok/set/20190201T002639.000Z/jepsen.log
results-sorted/ok/set/20190201T003306.000Z/jepsen.log
results-sorted/ok/set/20190201T003922.000Z/jepsen.log
results-sorted/ok/set/20190201T004558.000Z/jepsen.log
results-sorted/ok/set/20190201T005243.000Z/jepsen.log
**results-sorted/invalid/set/20190201T005914.000Z/jepsen.log**
results-sorted/ok/set/20190201T010536.000Z/jepsen.log
results-sorted/ok/set/20190201T011157.000Z/jepsen.log
results-sorted/ok/set/20190201T011828.000Z/jepsen.log
results-sorted/ok/set/20190201T012505.000Z/jepsen.log
results-sorted/ok/set/20190201T013139.000Z/jepsen.log
results-sorted/no-history/set/20190201T013800.000Z/jepsen.log
results-sorted/ok/set/20190201T013841.000Z/jepsen.log
results-sorted/ok/set/20190201T014454.000Z/jepsen.log
results-sorted/ok/set/20190201T015125.000Z/jepsen.log
results-sorted/ok/set/20190201T015802.000Z/jepsen.log
results-sorted/no-history/set/20190201T020423.000Z/jepsen.log
results-sorted/ok/set/20190201T020503.000Z/jepsen.log
results-sorted/ok/set/20190201T021134.000Z/jepsen.log
results-sorted/ok/set/20190201T021759.000Z/jepsen.log
results-sorted/ok/set/20190201T022355.000Z/jepsen.log
results-sorted/ok/set/20190201T023029.000Z/jepsen.log
results-sorted/ok/set/20190201T023709.000Z/jepsen.log
results-sorted/ok/set/20190201T024332.000Z/jepsen.log
results-sorted/no-history/set/20190201T024947.000Z/jepsen.log
results-sorted/ok/set/20190201T025028.000Z/jepsen.log
results-sorted/ok/set/20190201T025712.000Z/jepsen.log
results-sorted/ok/set/20190201T030322.000Z/jepsen.log
results-sorted/ok/set/20190201T030954.000Z/jepsen.log
results-sorted/ok/set/20190201T031634.000Z/jepsen.log
results-sorted/ok/set/20190201T032254.000Z/jepsen.log
**results-sorted/invalid/set/20190201T032923.000Z/jepsen.log**
results-sorted/ok/set/20190201T033545.000Z/jepsen.log
results-sorted/ok/set/20190201T034211.000Z/jepsen.log
results-sorted/ok/set/20190201T034836.000Z/jepsen.log
results-sorted/ok/set/20190201T035511.000Z/jepsen.log
results-sorted/ok/set/20190201T040134.000Z/jepsen.log
results-sorted/ok/set/20190201T040813.000Z/jepsen.log
results-sorted/ok/set/20190201T041448.000Z/jepsen.log
results-sorted/ok/set/20190201T042100.000Z/jepsen.log
results-sorted/ok/set/20190201T042725.000Z/jepsen.log
**results-sorted/invalid/set/20190201T043347.000Z/jepsen.log**
results-sorted/ok/set/20190201T044023.000Z/jepsen.log
results-sorted/ok/set/20190201T044704.000Z/jepsen.log
results-sorted/ok/set/20190201T045336.000Z/jepsen.log
results-sorted/ok/set/20190201T045954.000Z/jepsen.log
results-sorted/ok/set/20190201T050619.000Z/jepsen.log
results-sorted/ok/set/20190201T051242.000Z/jepsen.log
results-sorted/ok/set/20190201T051911.000Z/jepsen.log
results-sorted/ok/set/20190201T052553.000Z/jepsen.log
results-sorted/ok/set/20190201T053212.000Z/jepsen.log
results-sorted/ok/set/20190201T053843.000Z/jepsen.log
results-sorted/ok/set/20190201T054501.000Z/jepsen.log
results-sorted/ok/set/20190201T060216.000Z/jepsen.log
results-sorted/ok/set/20190201T060852.000Z/jepsen.log
results-sorted/ok/set/20190201T061536.000Z/jepsen.log
results-sorted/ok/set/20190201T062203.000Z/jepsen.log
results-sorted/ok/set/20190201T062846.000Z/jepsen.log
results-sorted/ok/set/20190201T063524.000Z/jepsen.log
results-sorted/ok/set/20190201T064205.000Z/jepsen.log
results-sorted/ok/set/20190201T064837.000Z/jepsen.log
results-sorted/ok/set/20190201T065512.000Z/jepsen.log
results-sorted/ok/set/20190201T070140.000Z/jepsen.log
**results-sorted/invalid/set/20190201T070741.000Z/jepsen.log**
results-sorted/ok/set/20190201T071339.000Z/jepsen.log
results-sorted/ok/set/20190201T072026.000Z/jepsen.log
results-sorted/ok/set/20190201T072704.000Z/jepsen.log
**results-sorted/invalid/set/20190201T073335.000Z/jepsen.log**
Just invalid results schedule: LXC
8 test runs
results-sorted/invalid/set/20190201T021432.000Z/jepsen.log
8 test runs
results-sorted/invalid/set/20190201T025508.000Z/jepsen.log
9 test runs
results-sorted/invalid/set/20190201T035926.000Z/jepsen.log
GCP cluster:
13 test runs
results-sorted/invalid/set/20190201T005914.000Z/jepsen.log
25 test runs
results-sorted/invalid/set/20190201T032923.000Z/jepsen.log
9 test runs
results-sorted/invalid/set/20190201T043347.000Z/jepsen.log
21 test runs
results-sorted/invalid/set/20190201T070741.000Z/jepsen.log
3 test runs
results-sorted/invalid/set/20190201T073335.000Z/jepsen.log
We have been able to repro this internally now, although still very very intermittently, by increasing load_balancer_max_concurrent_adds
from 1 (default) to 10.
From the logs we have a theory on what's going on. This appears to happen when raft membership changes (such as during a load balancing operation) AND some other failure (such as a network partition) happen at the same time. Based on the logs, we have a theory on this, and are now in the process of implementing/validating the fix.
When determining which OpIds have been replicated, we now only accept votes from peers that are VOTERS in the active config. Before this change we were accepting votes from any peer that was a member of the active config.
In Jepsen set tests, on a five-node debian cluster running 1.1.10.0, we can observe the apparent loss of acknowledged writes after randomized 3/2 network partitions. Writes succeed, are visible to every subsequent read operation, then, after some critical juncture, are missing from every subsequent read, despite that read observing all other records written before and after the missing element. The insertion of these elements, and the time they appear to go missing, appear correlated with the end of a network partition and the later recovery of the cluster to a healthy state.
For instance, in this test, element 427 is written successfully from 81.8 to 92.21 seconds, during which the nemesis heals a network partition. It is present in every subsequent read from 98.4 seconds to 141.1 seconds, when the cluster happens to be recovering from a network partition which ended about 15 seconds prior; it is missing from 142 seconds through the end of the test, at 185 seconds.
In this test, element 1063 is written successfully from 144.18 to 144.22 seconds, right when the cluster is recovering from a partition which ended ~15 seconds prior. Element 1063 is present immediately in subsequent reads at 144.24 seconds, and persists until 269 seconds, a read which completes immediately before the nemesis isolates n1 and n4 from n2, n3, and n5. The next read, at 303 seconds, fails to observe element 1063, and it remains missing for the remainder of the test. The final read occurs at 308 seconds.
In [this test](), element 5894 was inserted successfully from 300.55 to 300.57 seconds, just as the cluster is beginning to recover from a previous network partition. 5894 is present in subsequent reads at 304.2 seconds, and is last observed at 346.4 seconds, just prior to a partition isolating n3 and n5 from n1, n2, and n4. That partition ends at 406 seconds, and at 423 seconds, reads begin again--but none observe element 5894.