zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4k stars 1.01k forks source link

Commit Info missing on updates #1476

Open agillz opened 4 years ago

agillz commented 4 years ago

Hello Team Maxwell,

We are seeing a strange issue and currently unable to determine the root cause. We have maxwell running against a mysql aurora instance which has multiple databases. In the config, we have set output_commit_info=true

We are seeing some events are missing xid while others have it. Same table with same structure exists in multiple databases, sorry I had to scrub the database name as it is contains our client info. 1 copy 2 copy

osheroff commented 4 years ago

could they be myISAM tables in one database?

The other possibility is that you are, for some or other reason, receiving a COMMIT query event instead of or in addition to the XID event. Can you do:

show binlog events in 'binlog-file' from xxx limit 100; 

On the server without the XID, where xxx is the binlog position of the event.

agillz commented 4 years ago

Tables are InnoDB version 10 in all databases, it's the same server for events with xid and without xid. Here is what the event looks like in binlog and also attached output from maxwell.

4 3
osheroff commented 4 years ago

the only difference I notice between the two commits is that the XID overflows a 32-bit integer in the case where it's missing. I can't repro anything wrong there in the most synthetic of tests. before I go chasing further down this rabbit hole, can you verify that the raw JSON in your kafka/whatever stream is missing the xid? your error messages appear to be coming from a stream processor and I want to rule that out of the equation.

agillz commented 4 years ago

Hi Ben, confirmed that raw message on the topic also does not have xid. This output is using kafka-console-consumer

Screen Shot 2020-05-04 at 10 44 36 AM
osheroff commented 4 years ago

I'm utterly baffled. All of this must be happening here:

https://github.com/zendesk/maxwell/blob/564a32f24b71b46ef3d84b271557a68dc757ce3f/src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java#L459-L471

your row has commit: true, which means that line 462 is being fired. But xid must be null somehow, and the only way that's possible is if the call to setXid on line 468 isn't happening. event.xidData().getXid() returns a long, not a Long, which means that the underlying library can't be at fault; we'd see xid of 0 not null if that was the case.

So this means that eventType is not EventType.XID, but that means that we should be seeing a "COMMIT" event that's actually a EventType.QUERY`. Which we don't.

So. Back to the drawing board.

The only theory I've got, and it's admittedly really weak, is that somehow Aurora is pumping SQL-query-style commit into the replication stream that you don't see with SHOW BINLOG EVENTS.

Other weak theories:

Do you have other cases of the bug? Any commonalities?

agillz commented 4 years ago

Thanks for pointing out the code where xid is originating from. I will add an assertion there to log the raw event that has the issue. Hoping it will provide some insight. Would it helpful? Let me know if you want some more debugging to be added in getTransactionRows or elsewhere. Post this test ,if it seems be related to Aurora,I will get in touch with their technical support.

This is our first use case that relies on commit info thus we can not tell for sure how widespread the issue could be. We are reading from three different servers (5.6.mysql_aurora.1.22.2) using maxwell v1.24.1

osheroff commented 4 years ago

I will add an assertion there to log the raw event that has the issue. Hoping it will provide some insight. Would it helpful?

Yeah that'd be quite helpful. If we can catch this in action we'll be able to see why it happened... there's definitely not been any changes since 1.24.1 that would have fixed this, so let's stay there.