Bug: After a metadata update fails as expected for insufficient permissions, subsequent valid updates also fail for that client

Describe the bug

Error is Sync([CreateGroupContextExtProposalError(MlsGroupStateError(PendingCommit))])

See reproducing test below

https://github.com/xmtp/libxmtp/compare/cv/metadata-update-fails-after-failed-commit?expand=1

#[tokio::test]
    async fn test_can_update_gce_after_failed_commit() {
        // Step 1: Amal creates a group
        let amal = ClientBuilder::new_test_client(&generate_local_wallet()).await;
        let policies = Some(PreconfiguredPolicies::AllMembers);
        let amal_group = amal.create_group(policies).unwrap();
        amal_group.sync(&amal).await.unwrap();

        // Step 2:  Amal adds Bola to the group
        let bola = ClientBuilder::new_test_client(&generate_local_wallet()).await;
        amal_group.add_members_by_inbox_id(&amal, vec![bola.inbox_id()]).await.unwrap();

        // Step 3: Verify that Bola can update the group name, and amal sees the update
        bola.sync_welcomes().await.unwrap();
        let bola_groups = bola.find_groups(None, None, None, None).unwrap();
        let bola_group: &MlsGroup = bola_groups.first().unwrap();
        bola_group.sync(&bola).await.unwrap();
        bola_group.update_group_name(&bola, "Name Update 1".to_string()).await.unwrap();
        amal_group.sync(&amal).await.unwrap();
        let name = amal_group.group_name().unwrap();
        assert_eq!(name, "Name Update 1");

        // Step 4:  Bola attempts an action that they do not have permissions for like add admin, fails as expected
        let result = bola_group.update_admin_list(&bola, UpdateAdminListType::Add, bola.inbox_id()).await;
        if let Err(e) = &result {
            eprintln!("Error updating admin list: {:?}", e);
        }
        // Step 5: Now have Bola attempt to update the group name again => It is failing for some reason
        let result = bola_group.update_group_name(&bola, "Name Update 2".to_string()).await;
        if let Err(e) = &result {
            eprintln!("FAILING WHEN UPDATING GROUP NAME AFTER FAILED UPDATE COMMIT: {:?}", e);
        }

        // Step 6: Verify that the group name has been updated
        amal_group.sync(&amal).await.unwrap();
        let binding = amal_group.mutable_metadata().expect("msg");
        let amal_group_name: &String = binding
            .attributes
            .get(&MetadataField::GroupName.to_string())
            .unwrap();
        assert_eq!(amal_group_name, "Name Update 2"); // <= Currently failing because error above on second name update is unexpectedly showing up

        // If you comment out step 4, the test passes
    }

Expected behavior

No response

Steps to reproduce the bug

No response

This is a great catch and exactly the kind of thing we should be figuring out before launch. The error handling in process_own_message is faulty. In the case that merge_pending_commit fails, we do the right thing and clear the pending commit. In the case that ValidatedCommit::from_staged_commit fails we do the wrong thing and return an error without clearing the pending commit.

@cameronvoell I think more generally we should do an audit of the error handling during syncs. Feels like a super high ROI place to invest before launch, since inconsistencies here can brick groups. Can you own that process?

Any error that is not going to succeed on retry (like a permissions check failing) needs to clear the staged commit if you are the sender. Then we can move on to the next message in the group safely. If you aren't the sender, I think we can just roll back the DB transaction and move on to the next message.

For errors that are retriable (network, storage) we probably want to keep trying indefinitely so that a 5 minute network outage doesn't brick groups. It's better to have syncs fail for a period of time than to ignore a message that other group members might treat as valid.

There's obviously more than just that. Feels like a short doc or a long code comment would go a long way in defining the expected behaviour.

This is a great catch and exactly the kind of thing we should be figuring out before launch. The error handling in process_own_message is faulty. In the case that merge_pending_commit fails, we do the right thing and clear the pending commit. In the case that ValidatedCommit::from_staged_commit fails we do the wrong thing and return an error without clearing the pending commit.

@neekolas I think you're right this is where it is breaking. Though looks like a call to clear_pending_commit in process_own_messages did not solve it yet for me (tested with commit here)

I think I'm on the right track though. Logs look like the pending commit is coming back after I clear it, I'm thinking that the associated intent is not being deleted, and some intent retry is possibly bringing it back. Will continue on that tomorrow.

I think more generally we should do an audit of the error handling during syncs. Feels like a super high ROI place to invest before launch, since inconsistencies here can brick groups. Can you own that process?

Yep I can own that. 👌 I agree that this sync error handling is a great spot to focus on. As I wrap up this bug, I'll work on a doc that gives an overview on error handling in sync, and post it to the team for feedback. Can have an early version of that tues or wed of this week 👍

xmtp / libxmtp