yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.66k stars 1.04k forks source link

[DocDB] Tablet leader does not advance peer's state for next index after failed RBS #16149

Open arybochkin opened 1 year ago

arybochkin commented 1 year ago

Jira Link: DB-5585

Description

During investigation of an issue with RBS failure, it was found that tablet leader may stuck with old peer state and cannot move forward with by next RBS attempts. More context: 1) RBS download part took more than 15 hours 2) During files downloading tablet leader was continue receiving other operation and hence generating WALs. 3) RBS session was timed out and was removed during post-RBS local bootstraping, which led to WALs un-anchoring and GCing 4) Tablet peer was not able to catchup with tablet leader as WALs were GCed, let's say tablet peer's latest known operation was OP_X and tablet leader was being kept OP_X+1 in TrackedPeer::next_index 5) New RBS was triggered for the same tablet peer and all up-to-date data was successfully downloaded including OP_X.. OP_X+N 6) But tablet leader cannot send UpdateConsensus to make the peer become a follower/voter as still thinks the peer has OP_X as the latest operation.

Please refer to Jira item for more details.

bmatican commented 1 year ago

Passing to TK for triage.