xmtp / libxmtp

MIT License
33 stars 13 forks source link

Group limit benchmarks #793

Closed insipx closed 3 weeks ago

insipx commented 1 month ago

local gRPC benchmarks: benchmark report

Benchmarks against dev network: report

Message Size Limits

Issue: #812

This isn't exact, but we eventually reach the gRPC message size limit For this PR, I added in chunking for the send_welcomes fn which solves this, but we're still sending lots of data over the wire + encrypting a large welcome message, and very large group sizes will still hit the limit (it would require the ratchet tree itself to reach greater than the gRPC limit).

Benchmarking add_to_empty_group/500: Warming up for 3.0000 sAdding 500 members
2024-05-31T02:23:04.084639Z ERROR xmtp_mls::groups::sync: post commit error Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))
2024-05-31T02:23:04.084662Z ERROR xmtp_mls::groups::sync: error syncing group Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
2024-05-31T02:23:05.355537Z ERROR xmtp_mls::groups::sync: post commit error Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))
2024-05-31T02:23:05.355565Z ERROR xmtp_mls::groups::sync: error syncing group Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
2024-05-31T02:23:06.580907Z ERROR xmtp_mls::groups::sync: post commit error Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))
2024-05-31T02:23:06.580934Z ERROR xmtp_mls::groups::sync: error syncing group Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
thread 'main' panicked at xmtp_mls/benches/group_limit_empty.rs:59:65:
called `Result::unwrap()` on an `Err` value: Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Flamegraph (All flamegraphs are local gRPC)

CPU Flamegraph for member addition flamegraph

interactive flamegraph link

Add to empty group

tracing-flamegraph interactive link

Add to 100 member group

tracing-flamegraph interactive

Remove all members

tracing-flamegraph

interactive

Add 1 member to group

tracing-flamegraph interactive

Bugs

Verdict

It's difficult to recommend increasing the group size much above 400 if it takes > 3-4 seconds to add all those members -- although this also depends how long users would actually be willing to wait to add large amounts of members.

Adding members seems to be the slowest so performance focus should be on making that faster. Removing members (even large amounts) seems to be performing well compared to member addition.

~The flamegraph indicates we are spending lots of time on encrypt_welcome, so this could mean that focusing on decreasing the total welcome size could result in a more performant experience for member addition.~

After generating flamegraphs for each benchmark individually, it's clear that there are a few main areas that would benefit from optimization work:

fixes #810

neekolas commented 1 month ago

This is great work @insipx. This has come a long way since my dumb benchmarks. The flamegraphs are super helpful.

While the data would be noisier, I'd be interested in seeing what these numbers look like when we are communicating with the dev network where network latency plays a bigger factor. Those probably offer a better view of what real-world performance looks like.

insipx commented 1 month ago

Definitely can make a report using the dev network!

saulmc commented 1 month ago

It's difficult to recommend increasing the group size much above 400 if it takes > 3-4 seconds to add all those members -- although this also depends how long users would actually be willing to wait to add large amounts of members.

This seems quite manageable. As a group creator, if I already have a list of 5,000 addresses that I want to create a group from, there's no way I'm entering them by hand. It's no sweat waiting a minute while that group is created in the background by whatever bulk upload tool I'm using. I just expect to be notified when it's done. On the other hand, waiting an hour would be unacceptable.

There's currently a bug for adding to a populated group. Any benches that add to an already-populated group are not included and annotated with #[allow(dead_code)] until fixed.

I'm very interested to see how this benchmarks once the bug is fixed, as this is the scenario that concerns me more. Adding 1 address to a group of 400 is a by-hand operation that will be expected to complete instantly.

insipx commented 1 month ago

I'm very interested to see how this benchmarks once the bug is fixed, as this is the scenario that concerns me more. Adding 1 address to a group of 400 is a by-hand operation that will be expected to complete instantly.

That makes sense! That bug is an easier fix too, so it makes sense seeing those benchmarks to ensure adding to an already-large group is fast enough. According to Cryspens early benchmarks (which don't have the network overhead here, since it's only the openMLS library), adding to a 1000 member group takes ~734ms~ 12ms, so I would expect it to take at least that much time, but most likely more

insipx commented 4 weeks ago

@saulmc

benchmark adding 1 member to large group, up to 3200 members (past 3200, it takes way too long to add that many members to a group to make the benchmark worth it for now)

Adding one member to a group is much faster than creating a large group from scratch but could still use improvement. After ~4000 members, it will start to take more than 1 second to add a new member.

benchmarks for encrypt_welcome.

The average size of the welcome message at ~1600 members is 626,120 bytes which according to these benchmarks takes ~2ms. The network overhead in send_welcomes is more significant in this benchmark at around ~50ms from some improvised measurement, and there could be improvements made elsewhere in the codepath too. Making welcome size constant is sort of a kill 2 birds with one stone here which should bring it down some, but that's still not the whole story for making adding one member faster

franziskuskiefer commented 3 weeks ago

past 3200, it takes way too long to add that many members to a group to make the benchmark worth it for now

For the openmls benchmarks I generate groups separately and write them out to a file. Then you only need to read them in the benchmarks. Even that takes some time, but it allows testing larger groups. Maybe that's something you could do here as well?

insipx commented 3 weeks ago

For the openmls benchmarks I generate groups separately and write them out to a file. Then you only need to read them in the benchmarks. Even that takes some time, but it allows testing larger groups. Maybe that's something you could do here as well?

Yeah I already have logic to pre-generate the identities and write to a file, separately for local and our dev networks. I think most of this time is spent between the local libxmtp client and our backend service

Will update this with updated flamegraphs and dev network benchmarks today

insipx commented 3 weeks ago

@neekolas

dev benches

Updated PR/Verdict with benchmarks against dev net + flamegraphs for the individual benchmarks, was only able to get the group size up to 450. After 450 ran into errors with broken pipes and benchmark would not complete. I assume this is because it takes a long time and the gRPC server just closes the connection after a while.

So far, all the MLS stuff is really fast and barely shows up in the benchmarks. Most of the bottlenecks are things around gRPC and fetching identities/sending welcomes etc.