[DocDB] Add observability in raft replication

yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.

https://www.yugabyte.com

Other

8.88k stars 1.05k forks source link

[DocDB] Add observability in raft replication #16812

Open robertsami opened 1 year ago

robertsami commented 1 year ago

Jira Link: DB-6159

Description

We have performing_update_mutex_ and peer_lock_ in consensus_peers.cc. If these get held for too long raft heartbeats will go unserved and cause leadership loss and re-elections. We should explicitly log any time these locks are held too long to make it more clear that this is happening in cases where we see rapid/intermittent leadership loss

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

ajd12342 commented 10 months ago

Hi @robertsami ! I am Anuj Diwan, a Computer Science PhD student at UT Austin. I am part of a team along with @arjunrs1 (Arjun Somayazulu) and we're taking a graduate Distributed Systems course. For our course project, we are interested in contributing to Yugabyte. This issue is related to our course material. Could we work on this issue? Any pointers for us to get started would be appreciated as well.

Thanks and regards, Anuj.

rthallamko3 commented 10 months ago

@ajd12342 , Refer to Yugabyte contribution page for instructions - https://docs.yugabyte.com/preview/contribute/core-database/checklist/ Feel free to take up this issue. cc @Huqicheng in case you need references to this particular issue. If you get past the initial aspects, we can identify a couple more items in this area.

ajd12342 commented 10 months ago

@Huqicheng Thanks, we will take this issue up. Please feel free to assign it to us.

Our initial idea is to add a field to the lock that logs the timestamp at which it was locked (which naturally gets reset after an unlock and lock). Then, whenever someone queries the state of the lock or tries to get the lock, we can compare the current timestamp to this timestamp in order to decide whether to log that it has been taking too long. Is this reasonable?

ajd12342 commented 10 months ago

Hi @rthallamko3 could you assign this one to us if all looks good?

Huqicheng commented 10 months ago

@ajd12342 Sounds good. Is it possible to implement a wrapper of the lock instead of adding the field directly to the current lock?