yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.96k stars 1.07k forks source link

Periodic issues with clusters not coming online #2183

Closed aphyr closed 2 years ago

aphyr commented 5 years ago

Jira Link: DB-1913 This has been frustrating me for weeks now, and I can't make any sense of it. When we run lots of Jepsen tests in a row, we go through periods where the cluster sets up just fine... and periods where it fails to even accept client connections after cluster join. The fact that it's periodic--that for 10-30 minutes every cluster we set up fails to bind, then magically starts working for 40-70 minutes after that... it's weird. I feel like there has to be some state--additional processes, files, something--that we're not cleaning up between test runs, but it somehow resolves on its own after enough tries???

Screenshot from 2019-08-28 16-57-05

I thought this might be linked to one particular type of test, but that doesn't seem to be the case--dropping that test from the rotation still results in periodic outages. It also doesn't seem linked to any particular nemesis; we see this occur with no nemesis at all, with killing tservers, or masters, or clock skew, etc.

The logs are kind of a mess--there are lots of errors, but my understanding is that many of them are expected or not dealbreakers. I'm not sure what I'm looking for exactly. I've attached several logs from some no-nemesis runs, all of which failed to set up, in the hopes they might be helpful.

20190828T174432.000Z.zip 20190828T175451.000Z.zip 20190828T174432.000Z.zip 20190828T175249.000Z.zip 20190828T175047.000Z.zip 20190828T174836.000Z.zip

rthallamko3 commented 2 years ago

@qvad , Does this repro now? If not, can we resolve this issue?