Privacy for welcome messages

neekolas commented 8 months ago

I thought I would add some background to this issue on privacy in the XMTP network for context.

The MLS Delivery Service for us is a fleet of XMTP nodes. Today, clients connect directly to a node from their client. In future, we expect the nodes to connect to a Gateway that may be run by third parties (for example, the developer of the app you are using). The gateway would be responsible for connecting to the XMTP nodes on behalf of all clients.

Today, messages are stored in a relational database on the nodes with the following schema:

CREATE TABLE group_messages (
    id BIGSERIAL PRIMARY KEY,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
snormore marked this conversation as resolved.
    group_id BYTEA NOT NULL,
    data BYTEA NOT NULL,
    group_id_data_hash BYTEA NOT NULL
);

CREATE INDEX idx_group_messages_group_id_created_at ON group_messages(group_id, created_at);
CREATE UNIQUE INDEX idx_group_messages_group_id_data_hash ON group_messages (group_id_data_hash);

CREATE TABLE welcome_messages (
    id BIGSERIAL PRIMARY KEY,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    installation_key BYTEA NOT NULL, // This is the recipient's installation_id
    data BYTEA NOT NULL,
    installation_key_data_hash BYTEA NOT NULL
);

CREATE INDEX idx_welcome_messages_installation_key_created_at ON welcome_messages(installation_key, created_at);
CREATE UNIQUE INDEX idx_welcome_messages_group_key_data_hash ON welcome_messages (installation_key_data_hash);

A single copy of each group message is stored on the nodes irrespective of group size. Welcome messages are stored with a copy for each recipient, so that anyone can query for the full list of Welcome messages sent to them.

It's helpful to think of privacy separately for data at rest and data in transit.

At rest

XMTP nodes can be queried freely by the public. For example, anyone can connect to the StreamAllMessages endpoint of the nodes and receive a copy of all new messages sent on the network. They can also query any group_id and receive all group messages, and query for all welcome messages sent to an installation_id. We treat any data stored at rest as public and try to minimize the amount of unencrypted metadata stored to the bare minimum required for the network to function.

It is critical that any data at rest be of minimum utility to a passive attacker collecting all messages on the network. Knowing that N messages were sent to a group with random id abc123 is not very useful without additional context. There are millions of conversations on the XMTP network. But if the attacker were to know the membership of abc123, the sender IP or address of each individual message, or the type of content being sent, it could be used to de-anonymize usage of the network.

In transit

Communicating with the delivery service leaks different metadata. Requests to the nodes using MLS endpoints do not require any sort of client authentication token, but the IP of the sender is visible to the node. Each client IP will request data related to a specific set of installation_keys and group_ids. This can be used to identify group membership related to each IP and gives a strong hint as to the requester's installation_key.

Clients can choose to use Tor or other mixnet services to hide their originating IP address, or may decide that they trust the endpoint they are connecting to and choose not to use intermediaries to hide their IP. Because we expect many gateway nodes to be run by the same developer of the client application, we hope some application developers will choose to differentiate themselves in the market through their privacy choices (such as not storing logs containing client IPs). Ultimately, we cannot enforce these choices as the gateway deployer may freely modify our open source code.

Welcome message privacy

A passive attacker will know how many Welcome Messages have been sent to each identity on our network with the current architecture. This is currently required to efficiently index messages for clients and allow each client to know the list of conversations it has been invited to. We are actively researching alternative approaches that would allow us to keep the number of Welcome messages sent to each user private, but in the meantime our priority is to minimize the amount of associated metadata on these welcome messages.

We have identified the following set of potential threats to Welcome Message privacy:

Group ID of conversation (P0, should already be hidden via MLS encryption)
- While welcome messages are not sent at the exact same time as the commit, we may want to add an artificial and random delay to sending welcome messages to protect against timing attacks correlating the commit and the welcome
Sender identity (P0, should already be hidden via MLS encryption)
Other members invited to the group at the same time
- This can be determined by the Welcome message payload today (P0, must fix before launch), since all KeyPackageRefs are stored in the unencrypted payload
- This can also be determined by timing attacks (P1 to resolve). If an attacker knows that N Welcome Messages were sent at the exact same time, they would be able to estimate with a high confidence that the messages were invitations to the same group

neekolas commented 8 months ago

The following additional context was added by Cryspen outlining potential solutions

From https://www.rfc-editor.org/rfc/rfc9420.html#section-12.4.3.1:

In order to allow the same Welcome message to be sent to multiple new members, information describing the group is encrypted with a symmetric key and nonce derived from the joiner_secret for the new epoch. The joiner_secret is then encrypted to each new member using HPKE. In the same encrypted package, the committer transmits the path secret for the lowest (closest to the leaf) node that is contained in the direct paths of both the committer and the new member. This allows the new member to compute private keys for nodes in its direct path that are being reset by the corresponding Commit.

Since a Welcome message is sent to several parties, even though the contents are encrypted it is trivial to see which parties were invited to the same group. XMTP asked us to look at whether this can be prevented.

We do stress however that MLS in general does not protect metadata like this. For example, nearly all protocol messages contain the group id, which also leaks group membership. In many instances the distribution services will also need to know who to forward messages to, so membership information usually is known here as well. Without a good understanding of the general system architecture, it is hard to give advice on the best way forward.

Keeping these caveats in mind, we investigated some options for addressing the issue of metadata leakage.

Commit to each Add separately, yielding different Welcome messages

This would be the easiest implementation-wise, but would introduce significant group churn for large invites. This proposal does not hide all the other kinds of metadata leaked by MLS.

Add a breaking protocol change

For example, instead of using welcome messages, create an UnlinkableWelcome message type that uses the init_key from the key package for encryption, instead of symmetrically encrypting with the joiner_secret. While this might be slightly cheaper than double-encryption, it will not make a very large difference, since the encryption based on the joiner_secret is symmetric (i.e. fast).

One reason against this path is that breaking protocol changes should generally be avoided.

This proposal does not hide all the other kinds of metadata leaked by MLS.

Add an outer layer of encryption

This would basically add an end-to-end encrypted message-passing channel between any two parties. This change would be entirely at the application layer, which means it is not a breaking change to the protocol. The double encryption will incur some cost, but given that the encryption in MLS is symmetric, this should not be prohibitive. The outer encryption could be done using a long-lived HPKE key pair per party.

This proposal would protect all metadata that MLS leaks, not just welcome messages.

neekolas commented 8 months ago

I think we've made some decisions here:

We will take the "Add an outer layer of encryption" approach
The encryption will be done with HPKE using the init_key of the recipient's key package. The main reason for this choice is that it allows us to build on top of the same encryption primitives already used in MLS.
We will need to include an identifier for the init_key on the welcome message, along with the ciphertext, so that the recipient can know which key to use to decrypt. This shouldn't leak any new information, since Welcomes are already indexed on the recipient's installation ID and the key packages are publicly available.

xmtp / libxmtp