Closed cameronvoell closed 4 weeks ago
This is a great catch and exactly the kind of thing we should be figuring out before launch. The error handling in process_own_message
is faulty. In the case that merge_pending_commit
fails, we do the right thing and clear the pending commit. In the case that ValidatedCommit::from_staged_commit
fails we do the wrong thing and return an error without clearing the pending commit.
@cameronvoell I think more generally we should do an audit of the error handling during syncs. Feels like a super high ROI place to invest before launch, since inconsistencies here can brick groups. Can you own that process?
Any error that is not going to succeed on retry (like a permissions check failing) needs to clear the staged commit if you are the sender. Then we can move on to the next message in the group safely. If you aren't the sender, I think we can just roll back the DB transaction and move on to the next message.
For errors that are retriable (network, storage) we probably want to keep trying indefinitely so that a 5 minute network outage doesn't brick groups. It's better to have syncs fail for a period of time than to ignore a message that other group members might treat as valid.
There's obviously more than just that. Feels like a short doc or a long code comment would go a long way in defining the expected behaviour.
This is a great catch and exactly the kind of thing we should be figuring out before launch. The error handling in
process_own_message
is faulty. In the case thatmerge_pending_commit
fails, we do the right thing and clear the pending commit. In the case thatValidatedCommit::from_staged_commit
fails we do the wrong thing and return an error without clearing the pending commit.
@neekolas I think you're right this is where it is breaking. Though looks like a call to clear_pending_commit
in process_own_messages
did not solve it yet for me (tested with commit here)
I think I'm on the right track though. Logs look like the pending commit is coming back after I clear it, I'm thinking that the associated intent is not being deleted, and some intent retry is possibly bringing it back. Will continue on that tomorrow.
I think more generally we should do an audit of the error handling during syncs. Feels like a super high ROI place to invest before launch, since inconsistencies here can brick groups. Can you own that process?
Yep I can own that. 👌 I agree that this sync error handling is a great spot to focus on. As I wrap up this bug, I'll work on a doc that gives an overview on error handling in sync, and post it to the team for feedback. Can have an early version of that tues or wed of this week 👍
Describe the bug
Error is
Sync([CreateGroupContextExtProposalError(MlsGroupStateError(PendingCommit))])
See reproducing test below
https://github.com/xmtp/libxmtp/compare/cv/metadata-update-fails-after-failed-commit?expand=1
Expected behavior
No response
Steps to reproduce the bug
No response