Closed jkrcma closed 5 years ago
is there any logging information before the unhandled QueryEvent
?
wondering if there's a way we could locate where in the binlog maxwell is failing; unhandled query event seems to indicate that maxwell missed the COMMIT
of the previous transaction somehow, but I'm... really not sure quite how. getting a SHOW BINLOG EVENTS
around the relevant section of the binlog could help.
There is no information before the event.
I located the particular transaction using mysqlbinlog
tool but didn't see anything eye-catching besides the BEGIN
statement itself, I didn't even find any similarity in the transaction contents. Should I look backwards to check the previous statements?
yeah, I'm more interested in the statement directly before the BEGIN that triggered unhandled query event
Assuming that the transaction before is the one at the edge of GTID interval, then we are talking about a huuuuge transaction in today's case. There was definitely a big batch of large transactions going on the past few minutes before the unhandled event happened. At this time there was also a binary backup running at the same node, causing a bit more of replication delay.
I've isolated the transaction into a text file, but it looks complete, including the COMMIT
statement. Rough stats are:
The transaction before the GTID edge is a small single-statement UPDATE query.
I’m really shooting in the dark here, but I’m wondering if the issue with missing the COMMIT to that transaction is somehow correlated with the other GTID issue you’re seeing where it appears to ignore binlog events. Are the GTID numbers in the problematic section of the binlog strictly monotonic?
Also, just to confirm, is this what you’re seeing?
Transaction X arrives safe and sound at kinesis Transaction X+1 is never sent Transaction X+2 triggers the “unhandled query event” message.
Correct? Pinging @shyko to see if he has any GTID insight here
On Oct 18, 2018, at 05:12, Jakub Krčma notifications@github.com wrote:
Assuming that the transaction before is the one at the edge of GTID interval, then we are talking about a huuuuge transaction in today's case. There was definitely a big batch of large transactions going on the past few minutes before the unhandled event happened. At this time there was also a binary backup running at the same node, causing a bit more of replication delay.
I've isolated the transaction into a text file, but it looks complete, including the COMMIT statement. Rough stats are:
114 inserts 284 updates 497 deletes The transaction before the GTID edge is a small single-statement UPDATE query.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
You mean that replicator could get stuck at an expectation of COMMIT at certain position which never arrives? But the other issue is observable when I restart Maxwell to re-replicate the missing transactions of the gap (which is btw a bad idea because replication isn't deterministic?).
The issue hasn't happened since a week ago and binlogs from that time are erased, so I can't do any research now, what a luck :) I am not sure how to recognize whether transaction X+1 is sent or not but if our binlogs were not monotonic then we would have much bigger problems I believe.
Any ideas what to try? I am able to put one of our real slaves into maintenance mode and play with it arbitrarily if needed.
Something to try would be to see if it’s a problem that’s reliably reproduced with the binlog in question. You could use a second instance of maxwell with a combination of —replay
and —init_position
to try to reproduce the issue
On Oct 22, 2018, at 03:31, Jakub Krčma notifications@github.com wrote:
You mean that replicator could get stuck at an expectation of COMMIT at certain position which never arrives? But the other issue is observable when I restart Maxwell to re-replicate the missing transactions of the gap (which is btw a bad idea because replication isn't deterministic?).
The issue hasn't happened since a week ago and binlogs from that time are erased, so I can't do any research now, what a luck :) I am not sure how to recognize whether transaction X+1 is sent or not but if our binlogs were not monotonic then we would have much bigger problems I believe.
Any ideas what to try? I am able to put one of our real slaves into maintenance mode and play with it arbitrarily if needed.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Sorry for longer delay, I had another priorities at work. I found something which could be related to this issue. Do you know how well Maxwell or binlog connector can handle switching of binlog_format
to STATEMENT
and then back to ROW
inside a session?
I was refactoring our own implementation of slave lag metrics which was created before the pt-heartbeat
was introduced, and I also removed unused table which was supposed to store the MySQL thread loads across the cluster. There is an event running every second and inside this event there was a cleanup query to ensure the loads table doesn't grow too big. It looked like this:
....
SET SESSION binlog_format = 'STATEMENT';
DELETE FROM maintenance.mysqld_load WHERE `timestamp` < (UTC_TIMESTAMP() - INTERVAL 30 SECOND);
SET SESSION binlog_format = 'ROW';
Since I altered the event the number of issues with GTID gaps dropped nearly to zero. Considering that our services are facing increased traffic due a shopping season, the chance of a random coincidence is getting lower every day.
However there are still few gaps appearing. I found out that pt-table-checksum
also uses this trick. We run checksum once a week.
Tested it out locally, maxwell didn't seem to have a problem; it obviously ignores the STATEMENT based events, but then it seems to switch back normally. I'm obviously not discounting this as a cause, as it seems related, just that I couldn't repro locally.
I'm going to add some logging in the next release of maxwell to at least dump the binlog position of the bizarre unhandled "BEGIN" messages. Is there any other information that could help us here?
also, getting you to run maxwell with debug logging on might be a big help too. not sure but maybe.
The only reasonable idea I'm getting to while inspecting these issues all over again and not seeing any pattern is a race condition somewhere inside the replicator. Since my last comment on Saturday only one GTID gap appeared on our bigger instance, correlating with pt-table-checksum
which is running for 2 days already. The smaller instance is left with no gaps, which is really good to see for the ecosystem connected to the CDC!
Based on above I am not sure whether dumping the binlog position at the unhandled events is any helpful here. Also I'm afraid that enabling debug logging would fry our logging cluster (ELK) as the amount of messages would be simply huuuge.
Thanks for your kind help so far!
Glad to hear the problem is getting better! I trust your gut; I'd love however to figure out exactly having a session that flipped between session based and row-based might cause maxwell to go out to lunch. Could you dump the binlog around one of these events?
If your theory is correct, you should see maxwell output something like:
and then the GTID gap commences?
@osheroff I wish my guess was correct. It may be related somehow but there is still a high chance of statistics error sadly :( I didn't find any statements pt-table-checksum
uses in binlogs which are related to the edges of the GTID gap. And from my testing on a smaller dev server I wasn't unable to break the binlog of not-yet-committed query by switching the binlog_format
and executing some non-deterministic updates and committing later.
But I think I have something new for you, which might be useful, might be not or may inspire you to add some extra valuable output to the exception messages for the future releases. I spent couple of hours today analyzing it and wrote down notes from my observations:
15:13:28,516 WARN BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.event.deserialization.EventDataDeserializationException: Failed to deserialize data of EventHeaderV4{timestamp=1542121629000, eventType=EXT_UPDATE_ROWS, serverId=17, headerLength=19, dataLength=291, nextPosition=946876637, flags=0}
This is where a problem begins. Based on nextPosition
key I traced down a single UPDATE
event of one table, inside a big transaction marked by GTID 00003306-1111-0000-0000-000000000004:528661863
. The transaction is a batch update of products imported from a partner. You can imagine a cycle which committes big batches of changed data at once. This transaction is stored somewhere in the middle of binlog file.
Decoded binlog (filtered from sensitive data and shortened):
#181113 16:07:09 server id 17 end_log_pos 946872207 CRC32 0x3744547b GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528661861'/*!*/;
# at 946872207
#181113 16:07:09 server id 17 end_log_pos 946872270 CRC32 0xd9d2f693 Query thread_id=445618740 exec_time=0 error_code=0
SET TIMESTAMP=1542121629/*!*/;
BEGIN
/*!*/;
# at 946872270
#181113 16:07:09 server id 17 end_log_pos 946872411 CRC32 0x751a3fc9 Table_map: `xxxdb`.`xxxDO` mapped to number 10919
# at 946872411
#181113 16:07:09 server id 17 end_log_pos 946872596 CRC32 0xc27ad432 Write_rows: table id 10919 flags: STMT_END_F
# at 946872596
#181113 16:07:09 server id 17 end_log_pos 946872680 CRC32 0x3a9bd458 Table_map: `xxxdb`.`xxxQ2PI` mapped to number 11119
# at 946872680
#181113 16:07:09 server id 17 end_log_pos 946872755 CRC32 0xa4911e60 Write_rows: table id 11119 flags: STMT_END_F
# at 946872755
#181113 16:07:09 server id 17 end_log_pos 946872786 CRC32 0x4e4cfaff Xid = 18973713676
COMMIT/*!*/;
# at 946872786
#181113 16:07:09 server id 17 end_log_pos 946872834 CRC32 0x811d1a2e GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528661862'/*!*/;
# at 946872834
#181113 16:07:09 server id 17 end_log_pos 946872897 CRC32 0x85162e6e Query thread_id=445686711 exec_time=0 error_code=0
SET TIMESTAMP=1542121629/*!*/;
BEGIN
/*!*/;
# at 946872897
#181113 16:07:09 server id 17 end_log_pos 946872960 CRC32 0x56ece1fc Table_map: `xxxdb`.`xxxU` mapped to number 10741
# at 946872960
#181113 16:07:09 server id 17 end_log_pos 946873026 CRC32 0x21c1ac31 Write_rows: table id 10741 flags: STMT_END_F
# at 946873026
#181113 16:07:09 server id 17 end_log_pos 946873057 CRC32 0xae2a6028 Xid = 18973713795
COMMIT/*!*/;
# at 946873057
#181113 16:07:09 server id 17 end_log_pos 946873105 CRC32 0x8c814ddb GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528661863'/*!*/;
# at 946873105
#181113 16:07:09 server id 17 end_log_pos 946873168 CRC32 0xb80ee7aa Query thread_id=445460466 exec_time=0 error_code=0
SET TIMESTAMP=1542121629/*!*/;
BEGIN
/*!*/;
...
# at 946873590
#181113 16:07:09 server id 17 end_log_pos 946873657 CRC32 0xe66ccaad Table_map: `xxxdb`.`xxxAVS` mapped to number 10798
# at 946874541
#181113 16:07:09 server id 17 end_log_pos 946874609 CRC32 0x7288d003 Table_map: `xxxdb`.`xxxPIS` mapped to number 10811
# at 946874678
#181113 16:07:09 server id 17 end_log_pos 946874775 CRC32 0x2c0abb71 Table_map: `xxxdb`.`xxxP` mapped to number 10799
# at 946875550
#181113 16:07:09 server id 17 end_log_pos 946875632 CRC32 0xb813b5ea Table_map: `xxxdb`.`xxxIPV` mapped to number 10749
# at 946875854
#181113 16:07:09 server id 17 end_log_pos 946875943 CRC32 0x385b4b25 Table_map: `xxxdb`.`xxxIP` mapped to number 10752
# at 946875943
#181113 16:07:09 server id 17 end_log_pos 946876327 CRC32 0x61679e2d Update_rows: table id 10752
# at 946876327
#181113 16:07:09 server id 17 end_log_pos 946876637 CRC32 0x33fef936 Update_rows: table id 10749
!!!
# at 946876637
#181113 16:07:09 server id 17 end_log_pos 946877111 CRC32 0x2662fb1b Update_rows: table id 10799
!!!
# at 946877111
#181113 16:07:09 server id 17 end_log_pos 946877199 CRC32 0xe6ce0403 Update_rows: table id 10798
# at 946877199
#181113 16:07:09 server id 17 end_log_pos 946877263 CRC32 0xe9e3bbfb Update_rows: table id 10811
# at 946877263
#181113 16:07:09 server id 17 end_log_pos 946877304 CRC32 0x6093593a Write_rows: table id 11115 flags: STMT_END_F
...
Notice the area marked by !!!
. The same looking queries are executed hundred of times inside this single transaction. However this particular one which failed to deserialize was the first one of its kind of the whole transaction.
Follow up of Maxwell logs:
15:13:28,521 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
15:13:28,615 WARN BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-0000-0000-000000000001:1-4
8861335,00003306-1111-0000-0000-000000000002:1-33555,00003306-1111-0000-0000-000000000003:1-2752804977,00003306-1111-00
00-0000-000000000004:1-528661863,00003306-1111-1111-1111-111111111111:1-11000155952,00003306-2222-2222-2222-22222222222
2:1-8706021957 -- restarting
15:13:28,709 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-0000-0000-000000000001:1-48861335,
00003306-1111-0000-0000-000000000002:1-33555,00003306-1111-0000-0000-000000000003:1-2752804977,00003306-1111-0000-0000-
000000000004:1-528661863,00003306-1111-1111-1111-111111111111:1-11000155952,00003306-2222-2222-2222-222222222222:1-8706
021957 (sid:6379, cid:101593611)
15:13:28,709 INFO BinlogConnectorLifecycleListener - Binlog connected.
15:13:28,755 WARN BinlogConnectorReplicator - Unhandled QueryEvent inside transaction: QueryEventData{threadId=445687821, executionTime=0, errorCode=0, database='', sql='BEGIN'}
Notice the timestamps of log rows. Binlog reconnects correctly right after the deserialization causes an exception, let's say after 200 ms. The GTID set is probably corrupted at this moment because it contains the transaction ID where the exception was thrown (bug?).
Immediately after the reconnecting the Unhandled QueryEvent is captured. threadId
field gives me a little hint here, but not much. It executes the first transaction of a next binlog, whose transaction ID as we already know is very suspicious:
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#181113 16:08:33 server id 8 end_log_pos 120 CRC32 0x23958fb2 Start: binlog v 4, server v 5.6.41-84.1-log created 181
113 16:08:33
# at 120
#181113 16:08:33 server id 8 end_log_pos 391 CRC32 0x13d33342 Previous-GTIDs
# 00003306-1111-0000-0000-000000000001:1-48861335,
# 00003306-1111-0000-0000-000000000002:1-33555,
# 00003306-1111-0000-0000-000000000003:1-2752804977,
# 00003306-1111-0000-0000-000000000004:1-528671688,
# 00003306-1111-1111-1111-111111111111:1-11000155952,
# 00003306-2222-2222-2222-222222222222:1-8706021957
# at 391
#181113 16:08:33 server id 17 end_log_pos 439 CRC32 0x0defbac9 GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '00003306-1111-0000-0000-000000000004:528671689'/*!*/;
# at 439
#181113 16:08:33 server id 17 end_log_pos 502 CRC32 0x5230e608 Query thread_id=445687821 exec_time=0
error_code=0
SET TIMESTAMP=1542121713/*!*/;
SET @@session.pseudo_thread_id=445687821/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=524288/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
...
The GTID 00003306-1111-0000-0000-000000000004:528671689
is the edge of a gap which was newly created, but this particular transaction is executed accordingly.
The threadId=445687821
executed another 83 queries in the new binlog, ending at ID 528672781
, and another 353 queries in the old binlog, starting at ID 528664022
and ending at ID 528671671
.
This leads me to two questions:
mysqlbinlog
for you somehow, just tell me!Hope I brough some fresh ideas to you now :)
these are indeed some good new ideas. I think you've got the handle on what's happening. Let's see if I can summarize:
Now I'm not sure why this would cause the rest of the file to go missing, that doesn't make much sense.
What kind of error happens in event deserialization? If I can turn on more verbose output of mysqlbinlog for you somehow, just tell me!
Unfortunately there's a ton of ways that event deserialization can fail; it's thrown here:
In response to an IOException bubbling up from one of a jillion spots in the codebase. And more unfortunately, maxwell is losing the IOException that caused it because it's not logging getCause()
.
I'm going to immediately add logging for the exception, and can also start working on a fix, especially if the cause of the deserialization failure is just a lost network connection; in that case the fix should be to throw out the current transaction buffer and rewind the GTID point back before the transaction.
Now I'm not sure why this would cause the rest of the file to go missing, that doesn't make much sense.
Well today I noticed that Maxwell got stuck yesterday evening, draining up RAM for couple of hours and then just GC'ing the rest of the night, without any word in the log. I think there are many other things which don't make sense, but we need to step slowly and gradually in order to fix the most bothering issues :)
Every time I see such a global catch across the codebase I smell problems. Of course it might be simply network issues, considering the amount of data transferred there is some small percentage of lost packets caused by the mighty power of the chaos. For example we already had to deal with k8s DNS lookup issues caused exactly by natural packet loss.
Once the new version is out I'll give it a try and let you know! These gaps are really a big issue to us lately, especially before Xmas as our business is mostly e-commerce. Thank you
1.19.2 includes more logging about the cause of the deserialization errors....
To my knowledge, each end_log_pos
should match the position after # at
next line. From the snippet @jkrcma provided, there is a section of binlog events not following this pattern, for example end_log_pos 946873657
(assuming this section is unedited), that seems alarming.
@osheroff Excellent, I'm going to deploy the version to production env today @liulikun The positions in log snippets are unedited. I obfuscated the database and table names only
So here we go, from today's night. Starting directly with this log entry:
01:49:14,865 INFO BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
01:49:15,931 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
01:49:16,016 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-166980806
96,00003306-2222-2222-2222-222222222222:1-11339872796 (sid:6379, cid:793600206)
01:49:16,016 INFO BinlogConnectorLifecycleListener - Binlog connected.
01:49:16,061 WARN BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.network.ServerException: could not find next log; the first event 'servername-bin.030628' at 464942570, the last event read from '/data2/mysql_binlog/servername-bin.030628' at 464942636, the last byte read from '/data2/mysql_binlog/log-bin.030628' at 464942636.
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:882) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
01:49:16,068 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
01:49:16,160 WARN BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796 -- restarting
...
This log snippet is repeated over and over again, between these I could spot the unhandled QueryEvent:
01:52:02,084 WARN BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-1111-1111-111111111111:1-1
6698080696,00003306-2222-2222-2222-222222222222:1-11339872796 -- restarting
01:52:02,163 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-166980806
96,00003306-2222-2222-2222-222222222222:1-11339872796 (sid:6379, cid:793615882)
01:52:02,164 INFO BinlogConnectorLifecycleListener - Binlog connected.
01:52:02,295 WARN BinlogConnectorReplicator - Unhandled QueryEvent @ servername-bin.030629:279[00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11339927543] inside transaction: QueryEventData{threadId=2203765669, executionTime=1, errorCode=0, database='', sql='BEGIN'}
02:04:16,105 INFO BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
Since then the pos reported in the exception changes as the binlog number increases by two. The loop continues until the another same looking unhandled QueryEvent is thrown at 02:13:16,274
. The only difference is the executionTime
equal to 52
, which is a lot for BEGIN
:) The deadly loop stops.
At 2:24:08,461
a schema change is triggered by one of our crons.
02:25:21,372 WARN BinlogConnectorLifecycleListener - Communication failure.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_181]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_181]
at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:51) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:202) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:184) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readLong(ByteArrayInputStream.java:57) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.EventHeaderV4Deserializer.deserialize(EventHeaderV4Deserializer.java:34) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.EventHeaderV4Deserializer.deserialize(EventHeaderV4Deserializer.java:27) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:196) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:890) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
02:25:21,372 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
(I guess this is what we've been looking for)
Another unhandled QueryEvent is thrown immediately upon reconnection, followed by another storm:
02:25:21,592 WARN BinlogConnectorReplicator - Unhandled QueryEvent @ servername-bin.030634:279[00003306-1111-1111-1111-1111
11111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-113400806
38:11340164528-11340266567:11340285788-11340285788] inside transaction: QueryEventData{threadId=2203959941, executionTi
me=9, errorCode=0, database='', sql='BEGIN'}
02:56:28,377 WARN BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.event.deserialization.EventDataDeserializationException: Failed to deserialize data of EventHeaderV4{timestamp=1543892125000, eventType=EXT_UPDATE_ROWS, serverId=4, headerLength=19, dataLength=1243, nextPosition=1046190134, flags=0}
at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserialize
r.java:233) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:201)
~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:890) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.io.EOFException
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:190) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at java.io.InputStream.read(InputStream.java:170) ~[?:1.8.0_181]
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:96) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:89) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeVarString(AbstractRowsEventDataDeserializer.java:376) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:184) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:138) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserializeRows(UpdateRowsEventDataDeserializer.java:72) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:58) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:33) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:227) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
... 5 more
02:56:28,378 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
02:56:28,477 WARN BinlogConnectorReplicator - replicator stopped at position: 00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-11340080638:11340164528-11340266567:11340285788-11340643432 -- restarting
02:56:28,561 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-11340080638:11340164528-11340266567:11340285788-11340643432 (sid:6379, cid:793993642)
02:56:28,561 INFO BinlogConnectorLifecycleListener - Binlog connected.
02:56:28,601 WARN BinlogConnectorReplicator - Unhandled QueryEvent @ servername-bin.030638:279[00003306-1111-1111-1111-111111111111:1-16698080696,00003306-2222-2222-2222-222222222222:1-11339872796:11339927543-11340046134:11340078294-11340080638:11340164528-11340266567:11340285788-11340643432:11340643712-11340643712] inside transaction: QueryEventData{threadId=2204179849, executionTime=1, errorCode=0, database='', sql='BEGIN'}
(Caused by: java.io.EOFException
)
Between 04:06:00
and 06:28:00
there are around 10 of "storing schema ..." messages triggered by our very complicated cronjobs, but Maxwell is still running and replicating. Just... mind the gaps :)
Is it possible for you to sanitize and dump the full the log somewhere?
I think maybe this is one bug manifesting in two different ways. The bug is that when maxwell's connection is lost in the middle of processing a GTID transaction, the GTID pointer has already marked that transaction as processed. We then erroneously attempt to reconnect (in the middle of the transaction processing code, also bad) past the end of the transaction. This would explain the "could not find next" errors as well as the gaps.
I'm working on a fix for this issue; it's quite hard to create a reliable test case/repro, but I'm close.
In the meantime, I'd ask: why, do you think, is your maxwell->mysql connection so unstable? I can see a connection dropped by a router here and there, but losing your CX multiple times an hour seems excessive....
@osheroff Yes it is, preferrably over a private channel, give me a hint where we can exchange the data. Great news to hear anyway!
As for connection problems, we've been investigating the issues related to Kubernetes connectivity multiple times. There is one related SNAT/DNAT issue with current Linux kernels which is pain in the ass at the connection time, but I don't think this is the case?
I could try to run Maxwell elsewhere avoiding k8s, VLANs and various virtual routers in the path completely, maybe it is worth testing. We don't have any clue otherwise. Our Cisco Nexus switches don't report any unusual numbers in ethernet counters.
@jkrcma my email is ben dot osheroff at gmail dot com, hit me up...
I'd definitely think about bypassing the kubernetes bits. the only other thing I could think of is a rogue process-killer mysql server-side.
FYI some of this is going upstream... https://github.com/shyiko/mysql-binlog-connector-java/pull/250
@osheroff I'll send you the log output once Maxwell crashes again, had to start from scratch yesterday's evening and forgot to save the complete log output.
I've been running a separate Maxwell instance for couple of days on a different node, in docker within the same VLAN as our DB nodes are and so far there is none of the gap present at all. So you were probably right huh? :)
A process killer is present on our nodes, but trust me it was one of the dark places I checked first.
hey @jkrcma any problems since you fixed the network config?
@osheroff Hi in the new year! Not sure if packet drops are 100 % fixed, but as far as I know Maxwells were running without any issues during the Xmas holidays, then on January 1st one k8s node crashed and the routing became funky for some time. A lot of gaps have been generated :]
ok 1.19.5 has fixes for the issues raised here. Thanks @jkrcma for sticking with it; you have no idea how many folks cruise by with an issue but don't stay to help debug. Let me know how it works for you!
And I did look yesterday whether the new version is out yet :) Actually I have an idea what it feels like so I sympathize and helping you was one of the goals of my work. I updated the chart and deployed all instances, purging the current states. So we will see, as the net issues in k8s are still present though, probably due the same Linux NAT issues.
In the meantime I found other small bug, will fill the issue later :) Thank you so much for your work!
Looks like we are not done yet :(
One instance apparently got the unhandled query event correctly, however the GTID gap is still being created:
12:05:42,075 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-0000-0000-000000000002:1-52969974,... (sid:6379, cid:12747637)
12:05:42,075 INFO BinlogConnectorLifecycleListener - Binlog connected.
17:09:42,129 INFO BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
17:10:14,394 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
17:10:14,483 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-0000-0000-000000000002:1-52969974:53047682-55831299,... (sid:6379, cid:15173908)
On the other hand, the second instance (heavier throughput) still caught the same error as before:
16:50:23,904 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
16:50:23,995 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-17075273146,... (sid:6379, cid:832736)
16:50:23,995 INFO BinlogConnectorLifecycleListener - Binlog connected.
16:50:24,039 WARN BinlogConnectorReplicator - Unhandled QueryEvent @ mysql-bin.012109:279[00003306-1111-1111-1111-111111111111:1-17075273146,...] inside transaction: QueryEventData{threadId=2377788367, executionTime=917, errorCode=0, database='', sql='BEGIN'}
16:54:23,996 INFO BinaryLogClient - Trying to restore lost connection to 1.2.3.4:3306
16:54:34,380 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
16:54:34,470 INFO BinaryLogClient - Connected to 1.2.3.4:3306 at 00003306-1111-1111-1111-111111111111:1-17075273146:17075346085-17075346899,... (sid:6379, cid:833176)
16:54:34,471 INFO BinlogConnectorLifecycleListener - Binlog connected.
16:54:34,515 WARN BinlogConnectorReplicator - Unhandled QueryEvent @ mysql-bin.012110:279[00003306-1111-1111-1111-111111111111:1-17075273146:17075346085-17075346899,...] inside transaction: QueryEventData{threadId=2377832054, executionTime=900, errorCode=0, database='', sql='BEGIN'}
I copied the binlogs for later analysis tomorrow.
Looking at the trx 00003306-1111-1111-1111-111111111111:1-17075273146
, again it is a big chunk of updates going from log pos 505638485
to 506392595
and trx ...147
is also big, pos 506392643
to 506412071
. A pity the logfile position isn't printed in the logs so we don't know where exactly the unhandled event occured. I am afraid the fix you implemented didn't cover some edge cases.
Upon reconnecting Maxwell starts at the next binlog, creating the gap, which is I believe a safety condition to avoid unexpected behavior.
Interesting. There's a different pattern... here, the BinaryLogClient keepAlive thread is reconnecting us, and maxwell doesn't even really notice it's happening. When I was working on this code, I noticed we had to explicitly clear out the binlog filename and offset in order to reconnect properly (see https://github.com/zendesk/maxwell/pull/1186/files#diff-6c0364725d8f72f2d70f68c6c6115747R339).
Let's try disabling the native keepAlive thread, here's a quick release with the BinaryLogClient's keepAlive thread shut off: https://www.dropbox.com/s/1x2qko01n3obpqf/maxwell-1.19.5.1.tar.gz?dl=0. I think this will help out the UnhandledQueryEvent messages.
I'm less sure that this will fix the GTID gaps like the first client saw... was the connection message at 12:05:42,075
also part of a reconnect or was that the first connection that client had made?
Also I forgot to mention that the trx 17075273146
was actually found in the binlog 012108
, unlike the log message says, 012109
begins with trx 17075346084
.
The message was part of a reconnect. Our logs contain a lot of schema changes detected on daily basis so I didn't want to bother you with these and other clueless reconnects which happened before.
I'm going to deploy the quick release and let you know!
yes. this is starting to actually makes sense in my head. i think.
We're inside a large GTID transaction. the keep-alive thread detects a dead CX and reconnects us. But it doesn't do it in a safe, GTID-friendly way, it instead passes a full tuple of binlog filename, binlog position, and GTID. mysql sees this and utterly freaks out (why? I do not know) and starts replication at the next log file. This explains the gaps that start with the next file, as well as maxwell getting unexpected and odd events.
Benny the bookie lays 5 to 1 that turning off the keepalive thread fixes the whole ball of wax.
Ahh that sounds logical indeed. Maxwell is deployed so results should be noticeable in 24 hours. Thank you very much!
@osheroff It's working!!! Haven't seen a single Unhandled QueryEvent despite that there were over 9 connection drops since I deployed it.
edit after 24 hours: Still not a single gap, nor weird loops with missing binlogs and similar errors. I believe this also fixed them.
fuckin' awesome. I've added this to mainline in v1.19.6, this also needs to go upstream to open-replicator. But it's fixed in maxwell for now. Thanks again for the dedication @jkrcma
We have quite a big Percona 5.6 setup running, GTID, 4T database, over 1k transactions per second, Maxwell is connected to a slave node and replicates to RabbitMQ cluster. Repeatedly I observe this behavior to happen:
Once this is logged the whe whole binlog which is being replicated at the moment is ignored, until the next binlog file is opened at replication node. A big GTID gap counting tens of thousand transactions is created, but Maxwell doesn't try to catch it up at all. After there is a critical number of gaps present in the GTID set we are unable to get rid of them as the replication node already purged its binlogs.
I tried to get more deeply into the binlogs in order to find out what exactly causes the unhandled QueryEvent but I haven't found anything particularly useful.