zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4.04k stars 1.01k forks source link

Allow disabling of xid/commit fields to avoid disk buffering #191

Open danfiza opened 8 years ago

danfiza commented 8 years ago

I have seen an issue where if I update a large amount of records, ~400,000+ rows with a single update query, a memory limit is reached and a spill over to disk, sometimes never finishing or recovering.

Is there a way to avoid such an issue? Or a way around it? Increasing some kind of memory limit?

This will also happen if an alter statement, i.e adding a new column with a new default value, occurs on a relatively large table..~1,000,000 rows.

osheroff commented 8 years ago

Yeah, the buffering to disk thing is something I'm fighting with right now. It fell out of an initial poor design in https://github.com/zendesk/maxwell/pull/80, which I'm considering tearing up. The basic problem is that if you want to include mysql's transaction ID in the row output, it's at the very end of the transaction, so you can't know it until you buffer all the rows in the transaction.

I think I should probably just make that an opt-in feature; most people probably won't care too much about the transaction id and the buffering is annoying.

danfiza commented 8 years ago

Ah I see that makes sense, well we are currently thinking about utilizing maxwell in a production environment as well. It would be nice if we could avoid this issue if possible, but this isn't a blocker since the issue can be avoided with LIMITs on updates and has a limited use case.

m-denton commented 6 years ago

@osheroff Was this ever revisited? We are experiencing this now as we have a cronjob running at night, axing around 300,000 rows. Once this happens, it loads up our local redis instance and then causes CPU burn.

Would seem beneficial to have conditional blacklisting possibilities for these scenarios.

mattcollins107 commented 6 years ago

In our use case, it would be sufficient to know the bounds of the transaction, as opposed to needing the xid and commit flag more specifically (these two appear tied together, the commit flag isn't included unless xid tracking is enabled). One possible idea would be to include BEGIN and COMMIT entries in the output, although arguably only COMMIT is generally needed, wherein those new events would be informational as opposed to strictly database changes.

Maxwell currently already identifies the beginning of a transaction, which starts queueing up the event buffer from the binlog, and upon recognizing the commit does it move the buffer through the rest of the process. It currently doesn't appear that the flag to skip tracking xids has any impact on the buffering strategy, but provided a flag was set, Maxwell could replace the buffering with a more simple single event streaming approach, wherein at least the COMMIT was an included event in the output to signal the end of a transaction (and implicitly the beginning of the next). This keeps the in-memory footprint low while still providing some support to processes that look for transactional frames.

That said, looking at the code, it's not clear to me how many rows in a single transaction would be too many. If that count could be mitigated to a pretty large number, it would matter less if the transactional events were buffered or streamed individually. Perhaps someone else has more insight?

osheroff commented 6 years ago

@m-denton, not sure what you mean by "causes CPU burn" -- on redis? On Maxwell? This won't help all that much with cpu usage, it's really more about memory/temp-disk space.

osheroff commented 6 years ago

@mattcollins107 I think the solution should be to have commit: true appear regardless of the state of the xid flag. That'd be enough for tx-boundary detection.

mattcollins107 commented 6 years ago

@osheroff Ha, you're right. That would work well and be a lot simpler of course, not sure why I was trying to make it so complicated. So unless xid was needed on all events in the transaction, we would really only need to buffer the last event to attach the commit flag. Having only one event ever in memory would be a nice win to avoid the memory pressures and maintain high speed.