zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4.04k stars 1.01k forks source link

Unhandled QueryEvent inside transaction #1129

Closed jkrcma closed 5 years ago

jkrcma commented 6 years ago

We have quite a big Percona 5.6 setup running, GTID, 4T database, over 1k transactions per second, Maxwell is connected to a slave node and replicates to RabbitMQ cluster. Repeatedly I observe this behavior to happen:

06:53:48,767 WARN  BinlogConnectorReplicator - Unhandled QueryEvent inside transaction: QueryEventData{threadId=1593427976, executionTime=17, errorCode=0, database='', sql='BEGIN'}

Once this is logged the whe whole binlog which is being replicated at the moment is ignored, until the next binlog file is opened at replication node. A big GTID gap counting tens of thousand transactions is created, but Maxwell doesn't try to catch it up at all. After there is a critical number of gaps present in the GTID set we are unable to get rid of them as the replication node already purged its binlogs.

I tried to get more deeply into the binlogs in order to find out what exactly causes the unhandled QueryEvent but I haven't found anything particularly useful.

osheroff commented 6 years ago

is there any logging information before the unhandled QueryEvent?

osheroff commented 6 years ago

wondering if there's a way we could locate where in the binlog maxwell is failing; unhandled query event seems to indicate that maxwell missed the COMMIT of the previous transaction somehow, but I'm... really not sure quite how. getting a SHOW BINLOG EVENTS around the relevant section of the binlog could help.

jkrcma commented 6 years ago

There is no information before the event.

I located the particular transaction using mysqlbinlog tool but didn't see anything eye-catching besides the BEGIN statement itself, I didn't even find any similarity in the transaction contents. Should I look backwards to check the previous statements?

osheroff commented 6 years ago

yeah, I'm more interested in the statement directly before the BEGIN that triggered unhandled query event

jkrcma commented 6 years ago

Assuming that the transaction before is the one at the edge of GTID interval, then we are talking about a huuuuge transaction in today's case. There was definitely a big batch of large transactions going on the past few minutes before the unhandled event happened. At this time there was also a binary backup running at the same node, causing a bit more of replication delay.

I've isolated the transaction into a text file, but it looks complete, including the COMMIT statement. Rough stats are:

The transaction before the GTID edge is a small single-statement UPDATE query.

osheroff commented 6 years ago

I’m really shooting in the dark here, but I’m wondering if the issue with missing the COMMIT to that transaction is somehow correlated with the other GTID issue you’re seeing where it appears to ignore binlog events. Are the GTID numbers in the problematic section of the binlog strictly monotonic?

Also, just to confirm, is this what you’re seeing?

Transaction X arrives safe and sound at kinesis Transaction X+1 is never sent Transaction X+2 triggers the “unhandled query event” message.

Correct? Pinging @shyko to see if he has any GTID insight here

On Oct 18, 2018, at 05:12, Jakub Krčma notifications@github.com wrote:

Assuming that the transaction before is the one at the edge of GTID interval, then we are talking about a huuuuge transaction in today's case. There was definitely a big batch of large transactions going on the past few minutes before the unhandled event happened. At this time there was also a binary backup running at the same node, causing a bit more of replication delay.

I've isolated the transaction into a text file, but it looks complete, including the COMMIT statement. Rough stats are:

114 inserts 284 updates 497 deletes The transaction before the GTID edge is a small single-statement UPDATE query.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jkrcma commented 6 years ago

You mean that replicator could get stuck at an expectation of COMMIT at certain position which never arrives? But the other issue is observable when I restart Maxwell to re-replicate the missing transactions of the gap (which is btw a bad idea because replication isn't deterministic?).

The issue hasn't happened since a week ago and binlogs from that time are erased, so I can't do any research now, what a luck :) I am not sure how to recognize whether transaction X+1 is sent or not but if our binlogs were not monotonic then we would have much bigger problems I believe.

Any ideas what to try? I am able to put one of our real slaves into maintenance mode and play with it arbitrarily if needed.

osheroff commented 6 years ago

Something to try would be to see if it’s a problem that’s reliably reproduced with the binlog in question. You could use a second instance of maxwell with a combination of —replay and —init_position to try to reproduce the issue

On Oct 22, 2018, at 03:31, Jakub Krčma notifications@github.com wrote:

You mean that replicator could get stuck at an expectation of COMMIT at certain position which never arrives? But the other issue is observable when I restart Maxwell to re-replicate the missing transactions of the gap (which is btw a bad idea because replication isn't deterministic?).

The issue hasn't happened since a week ago and binlogs from that time are erased, so I can't do any research now, what a luck :) I am not sure how to recognize whether transaction X+1 is sent or not but if our binlogs were not monotonic then we would have much bigger problems I believe.

Any ideas what to try? I am able to put one of our real slaves into maintenance mode and play with it arbitrarily if needed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jkrcma commented 5 years ago

Sorry for longer delay, I had another priorities at work. I found something which could be related to this issue. Do you know how well Maxwell or binlog connector can handle switching of binlog_format to STATEMENT and then back to ROW inside a session?

I was refactoring our own implementation of slave lag metrics which was created before the pt-heartbeat was introduced, and I also removed unused table which was supposed to store the MySQL thread loads across the cluster. There is an event running every second and inside this event there was a cleanup query to ensure the loads table doesn't grow too big. It looked like this:

    ....
    SET SESSION binlog_format = 'STATEMENT';
    DELETE FROM maintenance.mysqld_load WHERE `timestamp` < (UTC_TIMESTAMP() - INTERVAL 30 SECOND);
    SET SESSION binlog_format = 'ROW';

Since I altered the event the number of issues with GTID gaps dropped nearly to zero. Considering that our services are facing increased traffic due a shopping season, the chance of a random coincidence is getting lower every day.

However there are still few gaps appearing. I found out that pt-table-checksum also uses this trick. We run checksum once a week.

osheroff commented 5 years ago

Tested it out locally, maxwell didn't seem to have a problem; it obviously ignores the STATEMENT based events, but then it seems to switch back normally. I'm obviously not discounting this as a cause, as it seems related, just that I couldn't repro locally.

I'm going to add some logging in the next release of maxwell to at least dump the binlog position of the bizarre unhandled "BEGIN" messages. Is there any other information that could help us here?

also, getting you to run maxwell with debug logging on might be a big help too. not sure but maybe.

jkrcma commented 5 years ago

The only reasonable idea I'm getting to while inspecting these issues all over again and not seeing any pattern is a race condition somewhere inside the replicator. Since my last comment on Saturday only one GTID gap appeared on our bigger instance, correlating with pt-table-checksum which is running for 2 days already. The smaller instance is left with no gaps, which is really good to see for the ecosystem connected to the CDC!

Based on above I am not sure whether dumping the binlog position at the unhandled events is any helpful here. Also I'm afraid that enabling debug logging would fry our logging cluster (ELK) as the amount of messages would be simply huuuge.

Thanks for your kind help so far!

osheroff commented 5 years ago

Glad to hear the problem is getting better! I trust your gut; I'd love however to figure out exactly having a session that flipped between session based and row-based might cause maxwell to go out to lunch. Could you dump the binlog around one of these events?

If your theory is correct, you should see maxwell output something like:

and then the GTID gap commences?

jkrcma commented 5 years ago

@osheroff I wish my guess was correct. It may be related somehow but there is still a high chance of statistics error sadly :( I didn't find any statements pt-table-checksum uses in binlogs which are related to the edges of the GTID gap. And from my testing on a smaller dev server I wasn't unable to break the binlog of not-yet-committed query by switching the binlog_format and executing some non-deterministic updates and committing later.

But I think I have something new for you, which might be useful, might be not or may inspire you to add some extra valuable output to the exception messages for the future releases. I spent couple of hours today analyzing it and wrote down notes from my observations:

Deserialization failure (reported as #1133)

15:13:28,516 WARN  BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.event.deserialization.EventDataDeserializationException: Failed to deserialize data of EventHeaderV4{timestamp=1542121629000, eventType=EXT_UPDATE_ROWS, serverId=17, headerLength=19, dataLength=291, nextPosition=946876637, flags=0}

This is where a problem begins. Based on nextPosition key I traced down a single UPDATE event of one table, inside a big transaction marked by GTID 00003306-1111-0000-0000-000000000004:528661863. The transaction is a batch update of products imported from a partner. You can imagine a cycle which committes big batches of changed data at once. This transaction is stored somewhere in the middle of binlog file.

Decoded binlog (filtered from sensitive data and shortened):

#181113 16:07:09 server id 17  end_log_pos 946872207 CRC32 0x3744547b   GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528661861'/*!*/;
# at 946872207
#181113 16:07:09 server id 17  end_log_pos 946872270 CRC32 0xd9d2f693   Query   thread_id=445618740     exec_time=0     error_code=0
SET TIMESTAMP=1542121629/*!*/;
BEGIN
/*!*/;
# at 946872270
#181113 16:07:09 server id 17  end_log_pos 946872411 CRC32 0x751a3fc9   Table_map: `xxxdb`.`xxxDO` mapped to number 10919
# at 946872411
#181113 16:07:09 server id 17  end_log_pos 946872596 CRC32 0xc27ad432   Write_rows: table id 10919 flags: STMT_END_F
# at 946872596
#181113 16:07:09 server id 17  end_log_pos 946872680 CRC32 0x3a9bd458   Table_map: `xxxdb`.`xxxQ2PI` mapped to number 11119
# at 946872680
#181113 16:07:09 server id 17  end_log_pos 946872755 CRC32 0xa4911e60   Write_rows: table id 11119 flags: STMT_END_F
# at 946872755
#181113 16:07:09 server id 17  end_log_pos 946872786 CRC32 0x4e4cfaff   Xid = 18973713676
COMMIT/*!*/;
# at 946872786
#181113 16:07:09 server id 17  end_log_pos 946872834 CRC32 0x811d1a2e   GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528661862'/*!*/;
# at 946872834
#181113 16:07:09 server id 17  end_log_pos 946872897 CRC32 0x85162e6e   Query   thread_id=445686711     exec_time=0     error_code=0
SET TIMESTAMP=1542121629/*!*/;
BEGIN
/*!*/;
# at 946872897
#181113 16:07:09 server id 17  end_log_pos 946872960 CRC32 0x56ece1fc   Table_map: `xxxdb`.`xxxU` mapped to number 10741
# at 946872960
#181113 16:07:09 server id 17  end_log_pos 946873026 CRC32 0x21c1ac31   Write_rows: table id 10741 flags: STMT_END_F
# at 946873026
#181113 16:07:09 server id 17  end_log_pos 946873057 CRC32 0xae2a6028   Xid = 18973713795
COMMIT/*!*/;
# at 946873057
#181113 16:07:09 server id 17  end_log_pos 946873105 CRC32 0x8c814ddb   GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528661863'/*!*/;
# at 946873105
#181113 16:07:09 server id 17  end_log_pos 946873168 CRC32 0xb80ee7aa   Query   thread_id=445460466     exec_time=0     error_code=0
SET TIMESTAMP=1542121629/*!*/;
BEGIN 
/*!*/;

...

# at 946873590
#181113 16:07:09 server id 17  end_log_pos 946873657 CRC32 0xe66ccaad   Table_map: `xxxdb`.`xxxAVS` mapped to number 10798
# at 946874541
#181113 16:07:09 server id 17  end_log_pos 946874609 CRC32 0x7288d003   Table_map: `xxxdb`.`xxxPIS` mapped to number 10811
# at 946874678
#181113 16:07:09 server id 17  end_log_pos 946874775 CRC32 0x2c0abb71   Table_map: `xxxdb`.`xxxP` mapped to number 10799
# at 946875550
#181113 16:07:09 server id 17  end_log_pos 946875632 CRC32 0xb813b5ea   Table_map: `xxxdb`.`xxxIPV` mapped to number 10749
# at 946875854
#181113 16:07:09 server id 17  end_log_pos 946875943 CRC32 0x385b4b25   Table_map: `xxxdb`.`xxxIP` mapped to number 10752
# at 946875943
#181113 16:07:09 server id 17  end_log_pos 946876327 CRC32 0x61679e2d   Update_rows: table id 10752
# at 946876327
#181113 16:07:09 server id 17  end_log_pos 946876637 CRC32 0x33fef936   Update_rows: table id 10749

!!!

# at 946876637
#181113 16:07:09 server id 17  end_log_pos 946877111 CRC32 0x2662fb1b   Update_rows: table id 10799

!!!

# at 946877111
#181113 16:07:09 server id 17  end_log_pos 946877199 CRC32 0xe6ce0403   Update_rows: table id 10798
# at 946877199
#181113 16:07:09 server id 17  end_log_pos 946877263 CRC32 0xe9e3bbfb   Update_rows: table id 10811
# at 946877263
#181113 16:07:09 server id 17  end_log_pos 946877304 CRC32 0x6093593a   Write_rows: table id 11115 flags: STMT_END_F

...

Notice the area marked by !!!. The same looking queries are executed hundred of times inside this single transaction. However this particular one which failed to deserialize was the first one of its kind of the whole transaction.

Unhandled event

Follow up of Maxwell logs:

15:13:28,521 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
15:13:28,615 WARN  BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-0000-0000-000000000001:1-4
8861335,00003306-1111-0000-0000-000000000002:1-33555,00003306-1111-0000-0000-000000000003:1-2752804977,00003306-1111-00
00-0000-000000000004:1-528661863,00003306-1111-1111-1111-111111111111:1-11000155952,00003306-2222-2222-2222-22222222222
2:1-8706021957 -- restarting
15:13:28,709 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-0000-0000-000000000001:1-48861335,
00003306-1111-0000-0000-000000000002:1-33555,00003306-1111-0000-0000-000000000003:1-2752804977,00003306-1111-0000-0000-
000000000004:1-528661863,00003306-1111-1111-1111-111111111111:1-11000155952,00003306-2222-2222-2222-222222222222:1-8706
021957 (sid:6379, cid:101593611)
15:13:28,709 INFO  BinlogConnectorLifecycleListener - Binlog connected.
15:13:28,755 WARN  BinlogConnectorReplicator - Unhandled QueryEvent inside transaction: QueryEventData{threadId=445687821, executionTime=0, errorCode=0, database='', sql='BEGIN'}

Notice the timestamps of log rows. Binlog reconnects correctly right after the deserialization causes an exception, let's say after 200 ms. The GTID set is probably corrupted at this moment because it contains the transaction ID where the exception was thrown (bug?).

Immediately after the reconnecting the Unhandled QueryEvent is captured. threadId field gives me a little hint here, but not much. It executes the first transaction of a next binlog, whose transaction ID as we already know is very suspicious:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#181113 16:08:33 server id 8  end_log_pos 120 CRC32 0x23958fb2  Start: binlog v 4, server v 5.6.41-84.1-log created 181
113 16:08:33 
# at 120
#181113 16:08:33 server id 8  end_log_pos 391 CRC32 0x13d33342  Previous-GTIDs
# 00003306-1111-0000-0000-000000000001:1-48861335,
# 00003306-1111-0000-0000-000000000002:1-33555,
# 00003306-1111-0000-0000-000000000003:1-2752804977,
# 00003306-1111-0000-0000-000000000004:1-528671688,
# 00003306-1111-1111-1111-111111111111:1-11000155952,
# 00003306-2222-2222-2222-222222222222:1-8706021957
# at 391
#181113 16:08:33 server id 17  end_log_pos 439 CRC32 0x0defbac9         GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528671689'/*!*/; 
# at 439
#181113 16:08:33 server id 17  end_log_pos 502 CRC32 0x5230e608         Query   thread_id=445687821     exec_time=0
     error_code=0
SET TIMESTAMP=1542121713/*!*/;
SET @@session.pseudo_thread_id=445687821/*!*/; 
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=524288/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/; 
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
...

The GTID 00003306-1111-0000-0000-000000000004:528671689 is the edge of a gap which was newly created, but this particular transaction is executed accordingly.

The threadId=445687821 executed another 83 queries in the new binlog, ending at ID 528672781, and another 353 queries in the old binlog, starting at ID 528664022 and ending at ID 528671671.

Conclusion

This leads me to two questions:

  1. What kind of error happens in event deserialization? If I can turn on more verbose output of mysqlbinlog for you somehow, just tell me!
  2. What is possibly the only way to process over 10000 transactions in 239 ms? There is none, thus I believe the whole binlog is immediately skipped after the deserialization failure occurs.

Hope I brough some fresh ideas to you now :)

osheroff commented 5 years ago

these are indeed some good new ideas. I think you've got the handle on what's happening. Let's see if I can summarize:

  1. BinlogConnector has already marked this transaction as processed and advanced the GTID set.
  2. The transaction buffer still contains rows from the previous transaction.
  3. maxwell reconnects at the next transaction (B), leading to the "unhandled query event" errors.
  4. maxwell probably outputs a partial buffer of some of the rows of TX A and the rest of TX B.

Now I'm not sure why this would cause the rest of the file to go missing, that doesn't make much sense.

What kind of error happens in event deserialization? If I can turn on more verbose output of mysqlbinlog for you somehow, just tell me!

Unfortunately there's a ton of ways that event deserialization can fail; it's thrown here:

https://github.com/shyiko/mysql-binlog-connector-java/blob/f698427b151aadacba8e6ec305dbc6b5c9d61aee/src/main/java/com/github/shyiko/mysql/binlog/event/deserialization/EventDeserializer.java#L242

In response to an IOException bubbling up from one of a jillion spots in the codebase. And more unfortunately, maxwell is losing the IOException that caused it because it's not logging getCause().

I'm going to immediately add logging for the exception, and can also start working on a fix, especially if the cause of the deserialization failure is just a lost network connection; in that case the fix should be to throw out the current transaction buffer and rewind the GTID point back before the transaction.

jkrcma commented 5 years ago

Now I'm not sure why this would cause the rest of the file to go missing, that doesn't make much sense.

Well today I noticed that Maxwell got stuck yesterday evening, draining up RAM for couple of hours and then just GC'ing the rest of the night, without any word in the log. I think there are many other things which don't make sense, but we need to step slowly and gradually in order to fix the most bothering issues :)

Every time I see such a global catch across the codebase I smell problems. Of course it might be simply network issues, considering the amount of data transferred there is some small percentage of lost packets caused by the mighty power of the chaos. For example we already had to deal with k8s DNS lookup issues caused exactly by natural packet loss.

Once the new version is out I'll give it a try and let you know! These gaps are really a big issue to us lately, especially before Xmas as our business is mostly e-commerce. Thank you

osheroff commented 5 years ago

1.19.2 includes more logging about the cause of the deserialization errors....

liulikun commented 5 years ago

To my knowledge, each end_log_pos should match the position after # at next line. From the snippet @jkrcma provided, there is a section of binlog events not following this pattern, for example end_log_pos 946873657 (assuming this section is unedited), that seems alarming.

jkrcma commented 5 years ago

@osheroff Excellent, I'm going to deploy the version to production env today @liulikun The positions in log snippets are unedited. I obfuscated the database and table names only

jkrcma commented 5 years ago

So here we go, from today's night. Starting directly with this log entry:

01:49:14,865 INFO  BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
01:49:15,931 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
01:49:16,016 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-166980806
96,00003306-2222-2222-2222-222222222222:1-11339872796 (sid:6379, cid:793600206)
01:49:16,016 INFO  BinlogConnectorLifecycleListener - Binlog connected.
01:49:16,061 WARN  BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.network.ServerException: could not find next log; the first event 'servername-bin.030628' at 464942570, the last event read from '/data2/mysql_binlog/servername-bin.030628' at 464942636, the last byte read from '/data2/mysql_binlog/log-bin.030628' at 464942636.
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:882) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
01:49:16,068 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
01:49:16,160 WARN  BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796 -- restarting
...

This log snippet is repeated over and over again, between these I could spot the unhandled QueryEvent:

01:52:02,084 WARN  BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-1111-1111-111111111111:1-1
6698080696,00003306-2222-2222-2222-222222222222:1-11339872796 -- restarting
01:52:02,163 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-166980806
96,00003306-2222-2222-2222-222222222222:1-11339872796 (sid:6379, cid:793615882)
01:52:02,164 INFO  BinlogConnectorLifecycleListener - Binlog connected.
01:52:02,295 WARN  BinlogConnectorReplicator - Unhandled QueryEvent @ servername-bin.030629:279[00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11339927543] inside transaction: QueryEventData{threadId=2203765669, executionTime=1, errorCode=0, database='', sql='BEGIN'}
02:04:16,105 INFO  BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306

Since then the pos reported in the exception changes as the binlog number increases by two. The loop continues until the another same looking unhandled QueryEvent is thrown at 02:13:16,274. The only difference is the executionTime equal to 52, which is a lot for BEGIN :) The deadly loop stops.

At 2:24:08,461 a schema change is triggered by one of our crons.

02:25:21,372 WARN  BinlogConnectorLifecycleListener - Communication failure.
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_181]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_181]
        at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:51) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:202) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:184) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readLong(ByteArrayInputStream.java:57) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventHeaderV4Deserializer.deserialize(EventHeaderV4Deserializer.java:34) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventHeaderV4Deserializer.deserialize(EventHeaderV4Deserializer.java:27) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:196) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:890) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
02:25:21,372 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.

(I guess this is what we've been looking for)

Another unhandled QueryEvent is thrown immediately upon reconnection, followed by another storm:

02:25:21,592 WARN  BinlogConnectorReplicator - Unhandled QueryEvent @ servername-bin.030634:279[00003306-1111-1111-1111-1111
11111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-113400806
38:11340164528-11340266567:11340285788-11340285788] inside transaction: QueryEventData{threadId=2203959941, executionTi
me=9, errorCode=0, database='', sql='BEGIN'}
02:56:28,377 WARN  BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.event.deserialization.EventDataDeserializationException: Failed to deserialize data of EventHeaderV4{timestamp=1543892125000, eventType=EXT_UPDATE_ROWS, serverId=4, headerLength=19, dataLength=1243, nextPosition=1046190134, flags=0}
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserialize
r.java:233) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:201)
 ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:890) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.io.EOFException
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:190) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at java.io.InputStream.read(InputStream.java:170) ~[?:1.8.0_181]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:96) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:89) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeVarString(AbstractRowsEventDataDeserializer.java:376) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:184) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:138) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserializeRows(UpdateRowsEventDataDeserializer.java:72) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:58) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:33) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:227) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
        ... 5 more
02:56:28,378 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
02:56:28,477 WARN  BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-11340080638:11340164528-11340266567:11340285788-11340643432 -- restarting
02:56:28,561 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-11340080638:11340164528-11340266567:11340285788-11340643432 (sid:6379, cid:793993642)
02:56:28,561 INFO  BinlogConnectorLifecycleListener - Binlog connected.
02:56:28,601 WARN  BinlogConnectorReplicator - Unhandled QueryEvent @ servername-bin.030638:279[00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-11340080638:11340164528-11340266567:11340285788-11340643432:11340643712-11340643712] inside transaction: QueryEventData{threadId=2204179849, executionTime=1, errorCode=0, database='', sql='BEGIN'}

(Caused by: java.io.EOFException)

Between 04:06:00 and 06:28:00 there are around 10 of "storing schema ..." messages triggered by our very complicated cronjobs, but Maxwell is still running and replicating. Just... mind the gaps :)

osheroff commented 5 years ago

Is it possible for you to sanitize and dump the full the log somewhere?

I think maybe this is one bug manifesting in two different ways. The bug is that when maxwell's connection is lost in the middle of processing a GTID transaction, the GTID pointer has already marked that transaction as processed. We then erroneously attempt to reconnect (in the middle of the transaction processing code, also bad) past the end of the transaction. This would explain the "could not find next" errors as well as the gaps.

I'm working on a fix for this issue; it's quite hard to create a reliable test case/repro, but I'm close.

In the meantime, I'd ask: why, do you think, is your maxwell->mysql connection so unstable? I can see a connection dropped by a router here and there, but losing your CX multiple times an hour seems excessive....

jkrcma commented 5 years ago

@osheroff Yes it is, preferrably over a private channel, give me a hint where we can exchange the data. Great news to hear anyway!

As for connection problems, we've been investigating the issues related to Kubernetes connectivity multiple times. There is one related SNAT/DNAT issue with current Linux kernels which is pain in the ass at the connection time, but I don't think this is the case?

I could try to run Maxwell elsewhere avoiding k8s, VLANs and various virtual routers in the path completely, maybe it is worth testing. We don't have any clue otherwise. Our Cisco Nexus switches don't report any unusual numbers in ethernet counters.

osheroff commented 5 years ago

@jkrcma my email is ben dot osheroff at gmail dot com, hit me up...

I'd definitely think about bypassing the kubernetes bits. the only other thing I could think of is a rogue process-killer mysql server-side.

osheroff commented 5 years ago

FYI some of this is going upstream... https://github.com/shyiko/mysql-binlog-connector-java/pull/250

jkrcma commented 5 years ago

@osheroff I'll send you the log output once Maxwell crashes again, had to start from scratch yesterday's evening and forgot to save the complete log output.

I've been running a separate Maxwell instance for couple of days on a different node, in docker within the same VLAN as our DB nodes are and so far there is none of the gap present at all. So you were probably right huh? :)

A process killer is present on our nodes, but trust me it was one of the dark places I checked first.

osheroff commented 5 years ago

hey @jkrcma any problems since you fixed the network config?

jkrcma commented 5 years ago

@osheroff Hi in the new year! Not sure if packet drops are 100 % fixed, but as far as I know Maxwells were running without any issues during the Xmas holidays, then on January 1st one k8s node crashed and the routing became funky for some time. A lot of gaps have been generated :]

osheroff commented 5 years ago

ok 1.19.5 has fixes for the issues raised here. Thanks @jkrcma for sticking with it; you have no idea how many folks cruise by with an issue but don't stay to help debug. Let me know how it works for you!

jkrcma commented 5 years ago

And I did look yesterday whether the new version is out yet :) Actually I have an idea what it feels like so I sympathize and helping you was one of the goals of my work. I updated the chart and deployed all instances, purging the current states. So we will see, as the net issues in k8s are still present though, probably due the same Linux NAT issues.

In the meantime I found other small bug, will fill the issue later :) Thank you so much for your work!

jkrcma commented 5 years ago

Looks like we are not done yet :(

One instance apparently got the unhandled query event correctly, however the GTID gap is still being created:

12:05:42,075 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-0000-0000-000000000002:1-52969974,... (sid:6379, cid:12747637)
12:05:42,075 INFO  BinlogConnectorLifecycleListener - Binlog connected.
17:09:42,129 INFO  BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
17:10:14,394 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
17:10:14,483 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-0000-0000-000000000002:1-52969974:53047682-55831299,... (sid:6379, cid:15173908)

On the other hand, the second instance (heavier throughput) still caught the same error as before:

16:50:23,904 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
16:50:23,995 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-17075273146,... (sid:6379, cid:832736)
16:50:23,995 INFO  BinlogConnectorLifecycleListener - Binlog connected.
16:50:24,039 WARN  BinlogConnectorReplicator - Unhandled QueryEvent @ mysql-bin.012109:279[00003306-1111-1111-1111-111111111111:1-17075273146,...] inside transaction: QueryEventData{threadId=2377788367, executionTime=917, errorCode=0, database='', sql='BEGIN'}
16:54:23,996 INFO  BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
16:54:34,380 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
16:54:34,470 INFO  BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-17075273146:17075346085-17075346899,... (sid:6379, cid:833176)
16:54:34,471 INFO  BinlogConnectorLifecycleListener - Binlog connected.
16:54:34,515 WARN  BinlogConnectorReplicator - Unhandled QueryEvent @ mysql-bin.012110:279[00003306-1111-1111-1111-111111111111:1-17075273146:17075346085-17075346899,...] inside transaction: QueryEventData{threadId=2377832054, executionTime=900, errorCode=0, database='', sql='BEGIN'}

I copied the binlogs for later analysis tomorrow.

jkrcma commented 5 years ago

Looking at the trx 00003306-1111-1111-1111-111111111111:1-17075273146, again it is a big chunk of updates going from log pos 505638485 to 506392595 and trx ...147 is also big, pos 506392643 to 506412071. A pity the logfile position isn't printed in the logs so we don't know where exactly the unhandled event occured. I am afraid the fix you implemented didn't cover some edge cases.

Upon reconnecting Maxwell starts at the next binlog, creating the gap, which is I believe a safety condition to avoid unexpected behavior.

osheroff commented 5 years ago

Interesting. There's a different pattern... here, the BinaryLogClient keepAlive thread is reconnecting us, and maxwell doesn't even really notice it's happening. When I was working on this code, I noticed we had to explicitly clear out the binlog filename and offset in order to reconnect properly (see https://github.com/zendesk/maxwell/pull/1186/files#diff-6c0364725d8f72f2d70f68c6c6115747R339).

Let's try disabling the native keepAlive thread, here's a quick release with the BinaryLogClient's keepAlive thread shut off: https://www.dropbox.com/s/1x2qko01n3obpqf/maxwell-1.19.5.1.tar.gz?dl=0. I think this will help out the UnhandledQueryEvent messages.

I'm less sure that this will fix the GTID gaps like the first client saw... was the connection message at 12:05:42,075 also part of a reconnect or was that the first connection that client had made?

jkrcma commented 5 years ago

Also I forgot to mention that the trx 17075273146 was actually found in the binlog 012108, unlike the log message says, 012109 begins with trx 17075346084.

The message was part of a reconnect. Our logs contain a lot of schema changes detected on daily basis so I didn't want to bother you with these and other clueless reconnects which happened before.

I'm going to deploy the quick release and let you know!

osheroff commented 5 years ago

yes. this is starting to actually makes sense in my head. i think.

We're inside a large GTID transaction. the keep-alive thread detects a dead CX and reconnects us. But it doesn't do it in a safe, GTID-friendly way, it instead passes a full tuple of binlog filename, binlog position, and GTID. mysql sees this and utterly freaks out (why? I do not know) and starts replication at the next log file. This explains the gaps that start with the next file, as well as maxwell getting unexpected and odd events.

Benny the bookie lays 5 to 1 that turning off the keepalive thread fixes the whole ball of wax.

jkrcma commented 5 years ago

Ahh that sounds logical indeed. Maxwell is deployed so results should be noticeable in 24 hours. Thank you very much!

jkrcma commented 5 years ago

@osheroff It's working!!! Haven't seen a single Unhandled QueryEvent despite that there were over 9 connection drops since I deployed it.

edit after 24 hours: Still not a single gap, nor weird loops with missing binlogs and similar errors. I believe this also fixed them.

osheroff commented 5 years ago

fuckin' awesome. I've added this to mainline in v1.19.6, this also needs to go upstream to open-replicator. But it's fixed in maxwell for now. Thanks again for the dedication @jkrcma