zarusz / SlimMessageBus

Lightweight message bus interface for .NET (pub/sub and request-response) with transport plugins for popular message brokers.
Apache License 2.0
467 stars 78 forks source link

[Host.Outbox | Host.Outbox.Sql] Poor performance when publishing outbox messages #260

Closed EtherZa closed 3 months ago

EtherZa commented 3 months ago

The performance of the outbox plug-in can be greatly improved; especially when publishing messages from the outbox/db.

A few of the issues:

SQL:

  1. PK of outbox table is using a clustered index with a unique identifier
  2. Locking of the table updates all undelivered rows instead of just those included in the batch
  3. Updating of delivery status is by multiple statements instead of a table-valued parameter (TVPs)
  4. Only a single application instance can spool messages (always FIFO, but not always required).

Host:

  1. Interceptors are executed on initial publish (before outbox) and again when being submitted to message bus (after outbox)
  2. Messages are trickle fed from the outbox to the message bus even when providers support batches

Proposal:

  1. Improve SQL schema, indexes and queries
  2. Apply an order to the outbox interceptor so that it is always the last interceptor to run (and skip running interceptors when publishing from outbox)
  3. Modify MessageBusBase.ProduceToTransport to accept batches of messages for the same bus and path
  4. Modify OutboxSendingTask to group messages for the same bus and path together and submit to MessageBusBase.ProduceToTransport directly without having to traverse through the pipeline stack again.
  5. Submit batches of messages to buses that support it while continuing to trickle feed to those that don't
zarusz commented 3 months ago

I am prepping a quick benchmark test to be able to assess the gains of this PR. Once I have something, I will report here.

EtherZa commented 3 months ago

I'm busy refactoring the OutboxSenfingTask to drop support for non-IMessageBulkProducer providers as you suggested.

As part of it, I'm also changing the lock to renew on a timer instead of killing the batch process. This should provide a little more safety when handling issues like ASB throttling.

I have also uncovered a potential critical issue which I'm putting a fix in for -the DeliveryAttempt value is set, but never evaluated. If a message ended up being too big for the queue to accept (for example), the current implementation will just keep trying indefinitely. Apart from the obvious resourcing problem, it would eventually lead to all available 'slots' in the batch being taken by such failures and the outbox stalling. I am just going to limit the attempts without delays, etc for now as it is an edge case. More can come later.

zarusz commented 3 months ago

@EtherZa the baseline test + 2 more lighter enhancements is in #263. Feel free to comment. Ideally, I would merge this to master, and if you could rebase your changes after.

Considering if it's worth including another transport that will not support bulk produce to get a feel if the gains are due to SQL improvements (lock acquiring logic, indexes) or due to bulk send to the underlying transport.

EtherZa commented 3 months ago

@zarusz I have a cursory glance at #263.

There are two other aspects that you may want to include for pure outbox testing:

  1. Consider dumping a large batch of messages into the outbox table outside of the pipeline (seed the table) to isolate the sql/batching from the natural population. The outbox read will far out perform single insertions.
  2. Batching is unfairly favoured when there is only one bus/path combination (every iteration will be piped as a batch if the transport supports it) as opposed to multiple batches per bus/path/iteration.
  3. There are two batching limits in play. The OutboxSettings.PollBatchSize and MaxMessagesPerTransaction. The later is a internal maximum batch size for each transport. ASB and AEH have the value set at "100" (max for TransactionScope support), while the other providers are all set to "1" by default (no batching). Performance of the "1" providers will be degraded compared to the original implementation as it currently will make a single sql update per sent message. This needs to change.

I am still trying to work out how/if TransactionScope can be used to wrap the bus/sql publish/update without escalating to a DTC.

zarusz commented 3 months ago

It was fixed via #261.