zarusz / SlimMessageBus

Lightweight message bus interface for .NET (pub/sub and request-response) with transport plugins for popular message brokers.
Apache License 2.0
467 stars 78 forks source link

[Host | Host.Outbox | Host.Outbox.Sql] #260 - Sql outbox performance improvments (batching + optimised queries) #261

Closed EtherZa closed 3 months ago

EtherZa commented 3 months ago

This comes as (another) unsolicited/undiscussed PR after investigating some performance issues that I was experiencing. Please feel free to be absolutely brutal with it and/or any aspect that is not to your liking.

The most jarring change is probably going to be with the addition of IMessageBusBulkProducer and the required changes to MessageBusBase.ProduceToTransport to allow batching and access from OutboxSendingTask.

Changes:

  1. SQL far more performant. Index improvements, limiting soft locks to fewer rows and single queries/TVP's have had a profound impact. In testing, I was seeing no performance degradation as the table grew to several million entries.
  2. OutboxSettings.MaintainSequence = false will allow for load balanced instances to process the outbox together (default is true to keep same behaviour).
  3. MessageBusBase.ProduceToTransport signature has been changed to accept multiple messages for the same bus/path. Serialization has also moved into the method to allow the same.
  4. OutboxForwardingPublishInterceptor is ordered with max int to ensure that it is the last interceptor to run. Publishing of messages from the outbox do not run through the interceptor pipeline (already done so before being placed in the outbox).
  5. Batching support has been added to the Azure Service Bus and Azure Event Hub providers. Publishing of messages is in multiple orders of magnitude quicker than before (not benchmarked). Messages are submitted in batches of up to 100 by default.

The submission is lacking in additional unit tests at the moment (can be added once direction is determined), but existing tests should cover most scenarios as the flow should not change. One test was removed as it was no longer relevant.


Again, I completely understand that this is undiscussed. It was thrown together while looking at what could be done to improve the implementation, so please feel free to fob it off as heading in the wrong direction or give any changes that you feel would provide something more to your liking. -Richard

zarusz commented 3 months ago

Thanks for looking into this.

Yes, I believe the outbox could have deserved a performance Improvement pass since it's inception.

I wanted to improve the performance of outbox especially when going into millions of messages and database contention in another project that SMB is used we saw some potential for improvement. So this should align with that I believe.

The batch send was also something on the roadmap, so it's another good avenue.

Let me take the time to review it in depth and align the implementation.

I would also love to add some performance test to he able to baseline before and after.

zarusz commented 3 months ago

@EtherZa you're definitely on the right track. I've reviewed the PR and provided some suggestions to consider.

As a side note, the bulk produce interface might be leveraged down the road in v3 to be able to send a batch of messages, or when (future releases?) we will bring in an auto batching of single message publishes.

I would love to see some benchmark test (as an integration test) using the existing Outbox infrastructure on Azure. I can take that on in a second iteration of this feature.

EtherZa commented 3 months ago

I would also love to add some performance test to he able to baseline before and after.

I hear you with this, but TBH, I don't know how/what to benchmark. The gains are primarily through reducing the number of calls over the network and the cumulative latency between those calls. Mocking out the calls would nullify the benchmark and leaving them in puts the weighting on latency which is dependent on available bandwidth and round trip time.

EtherZa commented 3 months ago

I would love to see some benchmark test (as an integration test) using the existing Outbox infrastructure on Azure. I can take that on in a second iteration of this feature.

For my own testing, I used the Given_CommandHandlerInTransaction_When_ExceptionThrownDuringHandlingRaisedAtTheEnd_Then_TransactionIsRolledBack_And_NoDataSaved_And_NoEventRaised test and manually dumping several thousand messages into a local containerised db. The difference is chalk and cheese.

EtherZa commented 3 months ago

As a side note, the bulk produce interface might be leveraged down the road in v3 to be able to send a batch of messages, or when (future releases?) we will bring in an auto batching of single message publishes.

Great stuff. There are further improvements that could be made with the outbox model too.

Battles need to be picked though :)

EtherZa commented 3 months ago

I have rebased from master and the run the first set of benchmarking tests that @zarusz added -thank you for adding them.

The results are skewed as the outbox is unfairly advantaged by having vastly lower latency and higher bandwidth available to it than the ASB, but do show an improvement.

Each set od results are from three consecutive runs on a release build using a local SQL instance (containerised) with an Azure Service Bus (Standard tier) in the Australia East region. I have +- 78ms latency with the Australia East data centre (https://www.azurespeed.com/Azure/Latency).

master

direct
Message Publish took    : 00:00:05.1472734  00:00:04.8023864  00:00:04.8906686
Message Consume took    : 00:00:08.2347485  00:00:07.9732388  00:00:07.1775174

with outbox
Message Publish took    : 00:00:01.1531716  00:00:00.7929830  00:00:00.7882242
Outbox Publish took : 00:01:14.1196937  00:01:14.7346352  00:01:15.8530969
Message Consume took    : 00:00:06.9526368  00:00:06.3923101  00:00:06.6098189

feature/260-sql-outbox-batching

direct
Message Publish took    : 00:00:04.8164905  00:00:04.6085346  00:00:05.1897898
Message Consume took    : 00:00:06.8671870  00:00:06.6343886  00:00:07.3392817

with outbox
Message Publish took    : 00:00:01.3155528  00:00:00.7840798  00:00:00.8276036
Outbox Publish took : 00:00:04.0035331  00:00:04.6453286  00:00:04.5108678
Message Consume took    : 00:00:06.6543769  00:00:06.9201661  00:00:06.8933468

There are more changes needed before this will be ready to merge

  1. Performance of non-bulk transports will be poor at the moment (worse than original)
  2. Although failed delivery attempts are recorded, the current implementation does not process those that fail any differently to a new message (if a failure is due to size/incompatibility with the svb, the outbox will eventually stagnate)
  3. The locking mechanism is not optimal. Instead of aborting progress when approaching the end of lock time, it would be better to extend the lock.

I am busy implementing the change to the three items above and will push asap.

Out of interest, I also ran single batches of 10k and 100k messages for the "with outbox" test only.

10k

Message Publish took    : 00:00:08.2749961
Outbox Publish took : 00:00:18.5741174
Message Consume took    : 00:01:07.0192878

100k

Message Publish took    : 00:01:20.7375655
Outbox Publish took : 00:02:59.2045206
Message Consume took    : 00:11:00.8795735
zarusz commented 3 months ago

@EtherZa It is exciting to see way better numbers after your changes, good job!

A couple of thoughts:

Ping me here once you feel you're done with changes for this iteration and I can review once more.

EtherZa commented 3 months ago

@zarusz I have pushed a commit for three items above. Please let me know if you are happy with the change, and I will rebase and sign. If you would prefer to defer it to another PR in order to limit scope/provide an alternate solution that's fine too -I can remove the commit easily from here.

Latest change includes:

It might be good to capture the numbers (from your local and/or from the github actions) and to make a note on the Outbox docs - so that folks have a feel about the overhead.

I think my timings are probably a little too biased. While there is a substantial performance improvement, mine are definitely stacked against the real world with the database being local and the svb being remote. GitHub actions would be better as there is no favour and they are repeatable/independent ...the only problem is that the timings are not being published (no logs for successful tests).

I might consider refactoring and segmenting the bulk vs non-bulk capable transports (low on my priority list). In the future (perhaps v3) elevating the core interfaces to be able to produce a batch of messages which would go hand in hand with this PR.

Supporting batch uploads to SVB/SQL would be a nice to have and would definitely be useful but probably not used too often.

In terms of performance, the impact from the boxing looks negligible IMO. It is too closely aligned with IO to make any tangible difference.

Before - Message Publish took   : 00:00:05.1472734  00:00:04.8023864  00:00:04.8906686
After  - Message Publish took   : 00:00:04.8164905  00:00:04.6085346  00:00:05.1897898

Looking at the consumer times for 10k v 100k for ASB it would require adding partitioning and more concurrent consumers to let it scale better.

As long as a single service can publish more messages from the outbox than that service can consume, I can't see an issue. It could become a problem if order is important though. It would kill scaling out in its current form. In that case, it would be better to lock by bus/path or custom strategy, so that multiple instances could each deliver for different streams.

Please let me know your thoughts and I'll take it from there.

EtherZa commented 3 months ago

@zarusz Comments resolved, test added and rebased

sonarcloud[bot] commented 3 months ago

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarCloud

zarusz commented 3 months ago

Thanks for this massive PR @EtherZa!

EtherZa commented 3 months ago

Thanks for this massive PR @EtherZa!

@zarusz It's a pleasure. Thanks for the help and library :)