Feature request: MLS Delivery Service Endpoints

neekolas commented 12 months ago

Is your feature request related to a problem?

MLS messages have strict rules around payload ordering, and require specialized endpoints to handle Key Package distribution that ensures a Key Package is only used once.

A normal MLS delivery service will keep state tracking the current epoch for each group, and ensure that the epoch in the unencrypted envelope of the message is not less than the current epoch. If it is less than the current epoch the service must return an error which will instruct the client to refresh the state of the group and try again.

Messages returned from the delivery service should be returned in a consistent order that cannot be spoofed by the client.

Describe the solution to the problem

The goal of this API is to simplify the development of MLS clients using XMTP by building a stateful MLS Delivery Service in front of our network. The primary value-add of this API is being able to validate data before storage, removing the need for conflict resolution on the client.

Data Storage

Group messages and welcome messages would continue to be stored on the XMTP network in topics under the /xmtp/3 prefix. No other publishing would be allowed to these topics. Each group would have a group_topic for group messages (Commit, Proposal, Application Message, etc), and each installation would have a welcome_topic to store its welcome messages. All group messages would be validated before they are written to a topic. Because Welcome Messages will be fully encrypted with SealedSender, those can continue to use the regular Publish endpoint and will need to be validated on the client.

Nodes would have a shared database table that keeps track of the current epoch for each group_id.

All messages stored on the XMTP network under the /xmtp/3 prefix would use the server timestamp as both sender_time and receiver_time on messages. It is not possible for clients to set the timestamp on messages manually. Additionally, nodes should synchronize their clocks using NTP to minimize clock drift.

While it may be possible to store Key Packages on the network, the fact that they must be deleted after consumption makes our current immutable publishing architecture less ideal there. I propose we store Key Packages in a separate database table on our nodes until the decentralized Contact Directory and Key Package service are ready for use. We can then have the same API route requests to those services and return identical responses.

Clients are expected to use the regular Query, BatchQuery, and Subscribe endpoints to read messages from the network.

Message validation

Because message contents are end-to-end-encrypted, there are limits to what we can validate on the server. The only fields available are the group_id and epoch, and any authentication token we include with requests. This allows us to validate that the message was sent from the current epoch, but we cannot validate that a commit was actually sent by a member of the group. This leaves open the potential for a client outside of a group to send a commit message that is accepted by the server.

Clients should advance their own epoch even for messages that otherwise fail validation. Empty epochs would need to be allowed.

API Spec

This is all pretty preliminary, but should be an indication of what we need to accelerate the development of MLS clients.

// RPCs for the new MLS API
service MlsApi {
  // Publish a MLS payload, that would be validated before being stored to the
  // network
  rpc MlsPublish(MlsPublishRequest) returns (google.protobuf.Empty) {}

  // Upload one or more Key Packages, which would be validated before storage
  rpc UploadKeyPackages(UploadKeyPackagesRequest) returns (google.protobuf.Empty) {}

  // Get one or more Key Packages by installation_id
  rpc GetKeyPackages(GetKeyPackagesRequest) returns (GetKeyPackagesResponse) {}

  // Would delete all key packages associated with the installation and mark
  // the installation as having been revoked
  rpc RevokeInstallation(RevokeInstallationRequest) returns (google.protobuf.Empty) {}

  // Used to check for changes related to members of a group.
  // Would return an array of any new installations associated with the wallet
  // address, and any revocations that have happened.
  rpc GetIdentityUpdates(GetIdentityUpdatesRequest) returns (GetIdentityUpdatesResponse) {}
}

message MlsPublishRequest {
  // This would be a serialized MLS message that the node would
  // parse and extract the group_id and epoch from
  // If the epoch is less than the node's state for that group, it would return
  // an error.
  bytes mls_message = 1;
}

message MlsPublishResponse {}

message UploadKeyPackagesRequest {
  message KeyPackageUpload {
    // This would be a serialized MLS key package that the node would
    // parse, validate, and then store.

    // The owner's wallet address would be extracted from the identity
    // credential in the key package, and all signatures would be validated.
    bytes key_package = 1;
    // The node will always treat the most recent last-resort key package as
    // the active one, and will ignore all others.
    bool is_last_resort = 2;
  }
  repeated KeyPackageUpload key_package = 1;
}

message UploadKeyPackageResponse {}

message GetKeyPackagesRequest {
  // The caller can provide an array of installation_ids, and the API
  // will consume one key package for each installation.
  // Once consumed, a regular key package cannot be used again.
  // If no key packages remain for the installation, the "last resort" key package may be returned
  repeated string installation_ids = 1;
}

message GetKeyPackagesResponse {
  message KeyPackage {
    bytes key_package = 1;
    bool is_last_resort = 2;
  }

  // Returns one key package per installation in the original order of the
  // request. If any installations are missing key packages, 
  repeated key_packages = 1;
}

message RevokeInstallationRequest {
  string installation_id = 1;
  // All revocations must be validated with a wallet signature over the
  // installation_id being revoked (and some sort of standard prologue)
  Signature wallet_signature = 2;
}

message GetIdentityUpdatesRequest {
  repeated string wallet_addresses = 1;
  uint64 start_time_ns = 2;
}

message GetIdentityUpdatesResponse {
  message IdentityUpdate {
    repeated string new_installation_ids = 1;
    repeated string revoked_installation_ids = 2;
  }

  // A list of updates (or empty objects if no changes) in the original order
  // of the request
  repeated IdentityUpdate updates = 1;
}

Describe the uses cases for the feature

No response

Additional details

One of the most annoying implementation challenges is probably the most mundane. OpenMLS serializes all messages using a customized TLS codec. To read these messages in go, we can either:

Change the serialization in OpenMLS to something more standard
Implement the TLS codec in golang and make structs that match OpenMLS data structures (I did a quick search and couldn't really find anything)
Make our Go code call in to a Rust library to handle deserialization and validation of messages

I think I'm in favour of #3, but could be convinced to go another way.

richardhuaaa commented 12 months ago

All messages stored on the XMTP network under the /xmtp/3 prefix would use the server timestamp as both sender_time and receiver_time on messages

What is the purpose of having both a sender_time and receiver_time, rather than a single field?
Could we consider populating two fields, client_time and server_time, and only using the server_time today, but leaving open the option of using client_time in the future?

richardhuaaa commented 12 months ago

There is a RevokeInstallation endpoint but not a GrantInstallation endpoint - is this deliberate? Note that Keypackages have an N:1 relationship to installations, so I don't think it's wise to use keypackages as a proxy for fetching the installation list for a given identity.

neekolas commented 12 months ago

What is the purpose of having both a sender_time and receiver_time, rather than a single field?

Basically just to maintain compatibility with the existing message storage. The nodes today support both, which is a holdover from Waku where nodes don't know if they are receiving messages from clients or getting them relayed from other nodes. One of the goals here is to be able to still leverage the existing Query, Subscribe, and BatchQuery endpoints, which all sort based on sender_timestamp today

There is a RevokeInstallation endpoint but not a GrantInstallation endpoint - is this deliberate? Note that Keypackages have an N:1 relationship to installations, so I don't think it's wise to use keypackages as a proxy for fetching the installation list for a given identity.

I was kinda imagining that the first key package uploaded would create the installation implicitly, but I'm open to making it explicit.

I was also thinking about using the GetIdentityUpdates with a start_time_ns of 0 as a way to get all installations for a wallet. There's probably some more design that needs to go into the request and response shape of that endpoint though.

neekolas commented 11 months ago

I've been digging in to what is the best way for the Go service to talk to Rust. I initially thought we'd just need deserialization of the Rust requests, but I've come to realize we really need to be able to talk to an instance of OpenMLS to properly validate things:

When uploading a Key Package we need to parse it and validate it (is it actually signed, are there valid credentials to associate it with a wallet address) before saving. Otherwise anyone could upload an invalid key package claiming it was from another user. OpenMLS has a validate function we can use to do all this if we had access to Rust code.
When sending a message, we need to somehow get the epoch and group ID out of the message. While we could duplicate that data as additional fields in the protobuf, without parsing the message we wouldn't be able to verify that the MlsMessage and the outer fields actually match. That doesn't feel great.

There are three real options I can see here:

We write a small Rust library for handling these narrow use-cases and some C bindings for it. Use Go's C FFI support to call it from inside our Node
Write the entire Delivery Service API in Rust and serve it separately. The Rust Delivery Service would have a database to store state (current epochs for each group ID, key packages), and would talk to the main API to publish messages that pass validation.
Have a small sidecar service written in Rust that lives next to our nodes to handle parsing and validation of requests. Communicate back and forth via API.

I was initially attracted to 1, since it leaves us with a single service and no new infra. But the more I look at it, the more of a pain it feels like. It's a complicated build process where you have to create a static or dynamic library and some C headers, then compile your Go program with the correct LD_LIBRARY_PATH to find the library. Doing something crazy like passing a slice back and forth requires scary unsafe Rust code and manually freeing of memory, and error handling is complicated. If we had any bugs there's a risk of SEGFAULTing our nodes. Plus, it's another complicated multi-repo build process that would be a pain to make changes to. Possible, but a fair bit of work and maintenance burden.

2 is also more annoying than it looks at first blush. In addition to actually writing the service, which I don't think would be that bad, there's some hidden work. We'd need to hook it up to GRPC Gateway via a sidecar container so that we could get a JSON API for requests coming from the browser. We'd need to get it deployed somewhere accessible to the public, with proper monitoring and metrics. And we'd need to build some special authentication into our nodes so that only the service could publish to V3 topics. Plus writing the service in Rust is going to be slower than building on top of our existing Go service.

I'm leaning towards option 3. It leaves us with a very small Rust service that can be deployed as a Docker container and tested properly on its own. The vast majority of the delivery service code stays in Go. It lets us pass around complex request/response types easily, and encapsulates errors in a separate service with its own memory. No complicated build steps or FFI. Just generally feels like the most maintainable and least risky option.

Any objections @richardhuaaa @snormore

snormore commented 11 months ago

(3) makes sense to me :+1:

xmtp / xmtp-node-go