zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4.03k stars 1.01k forks source link

Does maxwell support transaction event grouping? #504

Open ghost opened 7 years ago

ghost commented 7 years ago

Hi guys,

I am currently evaluating maxwell for my project needs. My understanding is that maxwell will always generate an event once DDL/DML statement is met in a binary log. With this in mind, I wonder if it is possible to activate transaction event grouping in maxwell? That is, queue up all events that arrive after a BEGINquery is encountered, wait for a COMMIT query to arrive, then flush all queued events and save the corresponding binary log position to mark the processing of the entire transaction.

osheroff commented 7 years ago

hi @dmytrokoval, yes and no.

Yes: Maxwell ever saves a binlog position when it completes a transaction. Mysql itself doesn't write data into the binlog until it's committed. Maxwell includes the xid field in its output so that you can correlate rows from different transactions back to a single transaction. So output looks like:

BEGIN
insert into test.foo set id = 1;
insert into test.foo set id = 2;
COMMIT
{"database":"test","table":"foo","type":"insert","xid":69283,"thread_id":126,"data":{"id":1}}
{"database":"test","table":"foo","type":"insert","xid":69283,"commit":true,"thread_id":126,"data":{"id":2}}

But no: Maxwell outputs 1 message per row, and there's currently no feature to batch up the rows in a transaction into an array or single message.

ghost commented 7 years ago

Do you guys consider adding transaction event grouping in future releases? In my view, it is going to be a highly demandable feature by maxwell's users. It would be great to have sort of a callback that will allow the caller to group events that belong to the same transaction in the way it wants, plus ensure that the saved binlog position in the positions table points to the beginning of the latest observed transaction. It would dramatically simplify recovery scenarios if transaction event grouping mode is enabled.

nzwalker commented 7 years ago

We have run into similar issues, having the transactions grouped makes it easier to see which events occur together. Instead of writing it into Maxwell, we created an intermediate Kafka consumer which batched the events, combined them into a single TX event and sent them to another Kafka topic.

This has a major flaw which is not immediately obvious: unbounded event counts for a single transaction. Since the number of events in a single TX (same XID) are unbounded, your resulting message size is also unbounded. This is especially problematic for Kafka, since you need to enforce sane message size limits to ensure replication and garbage collection is performant.

You can get around this obviously by enforcing some sane limits, splitting up transactions etc - but we've found it much easier to do the combination logic in downstream consumers instead.

osheroff commented 7 years ago

@dmytrokoval yeah what @nzwalker said :)

We do somewhat of the same thing at zendesk, deploying a little project that rolls up a filters into a secondary stream. It seems ok on a per-use-case basis, as each transaction rollup can sort of define its own parameters; if it spills over the size, drop it, or split it. But trying to define what all that means inside maxwell itself seems like a nightmare, so I avoid it.

@henders what say we clean-up and open-source 'maxwell-filters'?

Do note that maxwell is smart about the binlog position; it only ever writes binlog positions at the end of transactions. see https://github.com/zendesk/maxwell/blob/a29e62a804704fd9cfee5c78ddc3fc8d7e270d73/src/main/java/com/zendesk/maxwell/producer/MaxwellKafkaProducer.java#L60 for example; we only advance the binlog pointer on the final row of the transaction.

ghost commented 7 years ago

That sounds interesting, indeed. I would like to wrap my head around the concept of maxwell-filters.

On Dec 20, 2016 6:09 PM, "Ben Osheroff" notifications@github.com wrote:

@dmytrokoval https://github.com/dmytrokoval yeah what @nzwalker https://github.com/nzwalker said :)

We do somewhat of the same thing at zendesk, deploying a little project that rolls up a filters into a secondary stream. It seems ok on a per-use-case basis, as each transaction rollup can sort of define its own parameters; if it spills over the size, drop it, or split it. But trying to define what all that means inside maxwell itself seems like a nightmare, so I avoid it.

@henders https://github.com/henders what say we clean-up and open-source 'maxwell-filters'?

Do note that maxwell is smart about the binlog position; it only ever writes binlog positions at the end of transactions. see https://github.com/zendesk/maxwell/blob/a29e62a804704fd9cfee5c78ddc3fc 8d7e270d73/src/main/java/com/zendesk/maxwell/producer/ MaxwellKafkaProducer.java#L60 for example; we only advance the binlog pointer on the final row of the transaction.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zendesk/maxwell/issues/504#issuecomment-268299681, or mute the thread https://github.com/notifications/unsubscribe-auth/AXLMRnsRu5m8Yj04jnBX3xuqV4ze3vqlks5rKAuqgaJpZM4LN6LS .

henders commented 7 years ago

@osheroff : open-sourcing the maxwell-filters would take a bit of work to open-source some of it's dependencies as well, but it's do-able at the moment. Would be worth deciding if it's useful to open-source it soon as we were thinking of adding in more Zendesk-specific logic to filter extra things like data migrations.

henders commented 7 years ago

If it makes a difference for use and deployment @dmytrokoval, it's a small Scala app built on Akka-Streams-Kafka. We also ran into the problem of having transactions exceed the Kafka max-size, and we're currently dropping those after logging the incident.

pjebs commented 4 years ago

@osheroff Why is Maxwell grouping row events per transaction using xid instead of gtid?

Shouldn't Maxwell expose gtid instead?

pjebs commented 4 years ago

@propersam