zalando-nakadi / nakadi-producer-spring-boot-starter

Nakadi event producer as a Spring boot starter
MIT License
13 stars 8 forks source link

Support higher-volume cases (and potentially an ordering guarantee) by using pgq #139

Open ePaul opened 4 years ago

ePaul commented 4 years ago

Background

We currently inside Zalando have a discussion of how to implement reliable (transactional) event sending, which is basically what this library is trying to do.

When I mentioned this library (and that we are using similar approaches in another team, where we do a nightly full vacuum), it was pointed (by @CyberDem0n):

That's actually the major problem of such homegrown solutions.

  1. Write amplification (you are not only inserting into the queue table, but also updating/deleting).
  2. Permanent table and index bloat due to the 1.
  3. Regular heavy maintenance required due to the 2.
  4. Maintenance always affects normal processes interacting with the events table.
  5. In case if the event flow is relatively high, it quickly becomes not enough to do vacuum full/reindex only once a night.

In this regard pgq is maintenance free. For every queue you create, under the hood it creates a few tables. These tables are INSERT ONLY, therefore they are explicitly excluded from the autovacuum. Tables are used in the round-robin matter. Since events are always processed strictly in one order it is enough only to keep the pointer to the latest row(event) that was processed and no UPDATES/DELETES required on the event table. Once all events from the specific table are processed PgQ simply does TRUNCATE on this table. These tricks are making PgQ very scalable. Back 10 years ago, when PostgreSQL didn't yet have built-in streaming replication, the PgQ was used as a base for the logical replication, Londiste. Both solutions are developed by Mark Kreen while working for Skype. IIRC, 3 or 4 years ago Skype was still relying on PgQ and Londiste, because they just work.

@a1exsh pointed me to the pgq SQL API and promised to help with code review if we want to integrate this into this library.

Goal

Find a way of using a pgq queue instead of the current event_log table for storing the events for later Nakadi submission. This should be optional, as not every user of this library has pgq available, or the ability to install postgresql extensions.

ePaul commented 4 years ago

Implementation ideas/concerns

Abstracting queue access

This possibly can be done mostly by providing just a different implementation of the EventLogRepository interface – possibly with some adaptions (including the using code):

PGQ API documentation

Strict ordering vs. retrying

We might need some separate functionality for events which failed their submission:

I'm not sure how this can be abstracted in a useful way.

Consumption

Configuration

a1exsh commented 4 years ago

I think it makes sense to consider the task of consuming from a PGQ queue and publishing batches of events to Nakadi separately from the task of producing events for the queue in the first place. This way the former component may be reused more widely, e.g. when the application producing the events isn't using Spring (or Java, for that matter).

This library can still provide building blocks for both components, but it could be done with the potential separate deployment in mind.

ePaul commented 4 years ago

@a1exsh This is certainly something to consider. I just worry that having separate deployments will increase the configuration overhead (i.e. you can not just plug in this library in your application and everything works), but maybe it's the way to go. Maybe we can find a way to have both the easy setup, and an option for separating stuff for cases where it's needed.

If other libraries (or applications implementing this manually) will want to interoperate, we also need to make sure to specify the format of the events in the queue, which limits our ability to evolve this in the future.