zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4.04k stars 1.01k forks source link

Schema corruptions in Maxwell DB #1954

Open Shukla-Ankur opened 1 year ago

Shukla-Ankur commented 1 year ago

this issue is both for gathering information and also to raise the issues seen in a production environment. MySQL v8.0.23 Maxwell v1.33.0 I witnessed following data corruption in Maxwell :

Maxwell stopped with following error

7145 [ERROR] BinlogConnectorReplicator: Unable to cast 1592874639366037573 (java.lang.Long) into column sample_db.sample_table. tenant_id (type 'varchar')
7216 [ERROR] Maxwell: checking for schema inconsistencies in sample_db.sample_table
7270 [ERROR] Maxwell: `sample_db`.sample_table`.`tenant_id ` has a type mismatch, varchar vs bigint in new
7270 [ERROR] Maxwell: `sample_db`.sample_table`.`tenant_id `has a type mismatch, bigint vs varchar in old
12283 [WARN] MaxwellContext: Timed out waiting for heartbeat 1670390597826
  1. this maxwell instance has been running for months.
  2. sample_table had a field defined as tenant_id bigint
  3. This field was originally varchar but changed to bigint last month. There are no further logs in binlog for an alter on this field to make it varchar
  4. I could find the following insert into maxwell.schemas as last insert into maxwell.schemas for this table. If you look at tx timestamp, it matches with our internal timestamp of alter of tenant_id type to bigint.

Any ideas of how a entry from schemas table be deleted/corrupted, if not done manually? Since this tx exists in binlog, this implies that the schemas tables, in fact, did have this entry. However, there were no tx for delete against maxwell.schemas table.

Can you please highlight known scenarios of data corruption in the Maxwell tables?

# original_commit_timestamp=1667899427715626 (2022-11-08 09:23:47.715626 UTC)
### INSERT INTO \`maxwell\`.\`schemas\`
### SET
###   @1=8
###   @2='binlog.000006'
###   @3=52023246
###   @4=1667899460709
###   @5=NULL
###   @6=7
###   @7='[{"type":"table-alter","database":"sample_db","table":"sample_table","old":{"database":"sample_db","charset":"utf8mb4","table":"sample_table","columns":[{"type":"varchar","name":"id","charset":"utf8mb4"},{"type":"bigint","name":"created_time","signed":true},{"type":"bigint","name":"updated_time","signed":true},{"type":"varchar","name":"tenant_id","charset":"utf8mb4"},{"type":"varchar","name":"global_context_id","charset":"utf8mb4"},{"type":"varchar","name":"auth_id","charset":"utf8mb4"},{"type":"bigint","name":"version","signed":true},{"type":"bit","name":"f_cdc"},{"type":"varchar","name":"spotlight_id","charset":"utf8mb4"},{"type":"varchar","name":"name","charset":"utf8mb4"},{"type":"bigint","name":"company_id","signed":true},{"type":"int","name":"is_active","signed":true},{"type":"json","name":"banners"},{"type":"varchar","name":"created_by","charset":"utf8mb4"},{"type":"varchar","name":"updated_by","charset":"utf8mb4"},{"type":"varchar","name":"type","charset":"utf8mb4"},{"type":"enum","name":"state","enum-values":["NONE","DRAFT","PUBLISHED"]}],"primary-key":["id","tenant_id"]},"def":{"database":"sample_db","charset":"utf8mb4","table":"sample_table","columns":[{"type":"varchar","name":"id","charset":"utf8mb4"},{"type":"bigint","name":"created_time","signed":true},{"type":"bigint","name":"updated_time","signed":true},{"type":"bigint","name":"tenant_id","signed":true},{"type":"varchar","name":"global_context_id","charset":"utf8mb4"},{"type":"varchar","name":"auth_id","charset":"utf8mb4"},{"type":"bigint","name":"version","signed":true},{"type":"bit","name":"f_cdc"},{"type":"varchar","name":"spotlight_id","charset":"utf8mb4"},{"type":"varchar","name":"name","charset":"utf8mb4"},{"type":"bigint","name":"company_id","signed":true},{"type":"int","name":"is_active","signed":true},{"type":"json","name":"banners"},{"type":"varchar","name":"created_by","charset":"utf8mb4"},{"type":"varchar","name":"updated_by","charset":"utf8mb4"},{"type":"varchar","name":"type","charset":"utf8mb4"},{"type":"enum","name":"state","enum-values":["NONE","DRAFT","PUBLISHED"]}],"primary-key":["id","tenant_id"]}}]'
osheroff commented 1 year ago

changing the column name to sample_column isn't helping things -- in the binlog row you posted, which column corresponds to sample_column?

Does the row in maxwell.schemas row still exist?

Manually messing around with maxwell's position can lead to schema inconsistencies, if for instance you fail to process the change from one column type to another. I don't think that's what happened here.

When maxwell restarted did it hit the error condition again? I'm trying to figure out if the problem is with the schema data in mysql or in-memory.

Shukla-Ankur commented 1 year ago

Oh, that's my bad. It is the tenant_id column. I have removed 'sample_column' name.

osheroff commented 1 year ago

ok. So I still have the same questions:

It's very odd to me that maxwell processed the schema entry correctly and wrote it out to disk but somehow the in-memory version had an incorrect schema...

Shukla-Ankur commented 1 year ago
  1. No. I had to drop the maxwell DB and launch maxwell by giving init_position to rebuild the schema table in maxwell.
  2. Before dropping maxwell DB, there had been multiple restarts of the process. All of those failed with this error
  3. no. All CRUD operations have been running constantly against the table.
osheroff commented 1 year ago

do you have more maxwell logs from the month?

The only event that should cause a schema to change is processing an ALTER or other DDL statement. was there any indication of that happening around the time of the row update?

what was the binlog position of the row that caused the error? Can you dump that binlog event? Is there anything special or unusual about that row compared to other rows in the same table?

osheroff commented 1 year ago

BTW if this happens again please don't just DROP DATABASE maxwell -- rename it out of the way so you have a copy of the schema at the time -- it'd help debug this problem a lot.

Shukla-Ankur commented 1 year ago

let me gather and share all the maxwell logs I have. I still have the .ibd files for the DB. I'm not sure, but I'll try to restore it on another mysql instance. And yep, right mate, should have taken a backup at least or renamed it simply.

osheroff commented 1 year ago

it's like maxwell reset to an older version of the schema. could the binlog numbers have wrapped around or something crazy like that? At what position did the crash happen?

Shukla-Ankur commented 1 year ago
  1. Can't restore maxwell.
  2. But I do have binlog data. Issue was after position 172366339. Last binlog position record in maxwell was binlog.00007: 172366339
  3. As for your suggestion, I don't think that it wrapped back. The reason is that it would mean that maxwell would have wrapped back to a time before 8th Nov when the datatype of field was updated to bigint. If that were the case, the following binlogs would be old binlog but the issue occurred for a recent binlog position.
  4. Following are the some logs before the issue popped up.
#221207  5:19:48 server id 4  end_log_pos 172365598     Query   thread_id=667853    exec_time=0 error_code=0
SET TIMESTAMP=1670390388/*!*/;
BEGIN
/*!*/;
# at 172365598
#221207  5:19:48 server id 4  end_log_pos 172365671     Table_map: `maxwell`.`positions` mapped to number 203
# at 172365671
#221207  5:19:48 server id 4  end_log_pos 172365945     Update_rows: table id 203 flags: STMT_END_F
### UPDATE `maxwell`.`positions`
### WHERE
###   @1=0
###   @2='binlog.000007'
###   @3=172364491
###   @4='a772f477-e01f-4d39-b8b5-287d50885cf6:1-1786504'
###   @5='prod-cdc-rs-pageactivity-widgetlibrary-1'
###   @6=NULL
###   @7=1670390377818
### SET
###   @1=0
###   @2='binlog.000007'
###   @3=172365415
###   @4='a772f477-e01f-4d39-b8b5-287d50885cf6:1-1786506'
###   @5='prod-cdc-rs-pageactivity-widgetlibrary-1'
###   @6=NULL
###   @7=1670390387825
# at 172365945
#221207  5:19:48 server id 4  end_log_pos 172365972     Xid = 98718999
COMMIT/*!*/;
# at 172365972
#221207  5:19:57 server id 4  end_log_pos 172366054     GTID    last_committed=132262   sequence_number=132263  rbr_only=yes    original_committed_timestamp=1670390397814432   immediate_commit_timestamp=1670390397815582 transaction_length=394
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=1670390397814432 (2022-12-07 05:19:57.814432 UTC)
# immediate_commit_timestamp=1670390397815582 (2022-12-07 05:19:57.815582 UTC)
/*!80001 SET @@session.original_commit_timestamp=1670390397814432*//*!*/;
/*!80014 SET @@session.original_server_version=80023*//*!*/;
/*!80014 SET @@session.immediate_server_version=80023*//*!*/;
SET @@SESSION.GTID_NEXT= 'a772f477-e01f-4d39-b8b5-287d50885cf6:1786509'/*!*/;
# at 172366054
#221207  5:19:57 server id 4  end_log_pos 172366137     Query   thread_id=667853    exec_time=0 error_code=0
SET TIMESTAMP=1670390397/*!*/;
BEGIN
/*!*/;
# at 172366137
#221207  5:19:57 server id 4  end_log_pos 172366199     Table_map: `maxwell`.`heartbeats` mapped to number 208
# at 172366199
#221207  5:19:57 server id 4  end_log_pos 172366339     Update_rows: table id 208 flags: STMT_END_F
### UPDATE `maxwell`.`heartbeats`
### WHERE
###   @1=0
###   @2='prod-cdc-rs-pageactivity-widgetlibrary-1'
###   @3=1670390387825
### SET
###   @1=0
###   @2='prod-cdc-rs-pageactivity-widgetlibrary-1'
###   @3=1670390397834
# at 172366339
osheroff commented 1 year ago

do you have binlogs for the update/insert maxwell actually crashed on? the one with tenant_id

Shukla-Ankur commented 1 year ago

172366339 is the last binlog position recoded by maxwell while in error state. Following is the latest binlog entry for the same table :

`# at 172366978

221207 5:20:07 server id 4 end_log_pos 172367073 Query thread_id=667652 exec_time=0 error_code=0

SET TIMESTAMP=1670390407/!/; BEGIN /!/;

at 172367073

221207 5:20:07 server id 4 end_log_pos 172367193 Table_map: `sample_db`.`sample_table` mapped to number 244

at 172367193

221207 5:20:07 server id 4 end_log_pos 172369633 Write_rows: table id 244 flags: STMT_END_F

INSERT INTO `rsample_db`.`sample_table`

SET

@1='1600359477158153088|0'

@2=1670390407

@3=1670390407

@4=1592874639366037573

@5='xxxxx'

@6='162140f2f9a02000'

@7=1

@8=b'1'

@9='1600359477158153088'

@10=''

@11=1594627152669674254

@12=1`

The column of concern is 4

osheroff commented 1 year ago

ok, now can you find a previous mention of sample_table in the same binlog file? i have a theory (not a great one) that it's something about the table_id or the table cache...

Shukla-Ankur commented 1 year ago

This is the last change for table I could find before error:

`

at 172187180

221207 5:02:50 server id 4 end_log_pos 172187300 Table_map: sample_db.sample_table mapped to number 244

at 172187300

221207 5:02:50 server id 4 end_log_pos 172191330 Update_rows: table id 244 flags: STMT_END_F

UPDATE sample_db.sample_table

WHERE

@1='1600344186914783683|0'

@2=1670389361

@3=1670389361

@4=1499128199550042568

@5='readiness-gql_2022-12-07T05:02:41.060Z_4985d98681b3aa5a'

@6='14f337c6f5102001'

@7=1

@8=b'1'

@9='1600344186914783683'

@10=''

@11=1527232201613817061

@12=1

@13=''

@14='1509611627753775105'

@15='1509611627753775105'

@16='RICH_TEXT'

@17=2

SET

@1='1600344186914783683|0'

@2=1670389361

@3=1670389370

@4=1499128199550042568

@5='readiness-gql_2022-12-07T05:02:49.987Z_ba25db0880a5fad8'

@6='14f337c6f5102001'

@7=2

@8=b'1'

@9='1600344186914783683'

@10=''

@11=1527232201613817061

@12=1

@13=''

@14='1509611627753775105'

@15='1509611627753775105'

@16='RICH_TEXT'

@17=2

at 172191330

221207 5:02:50 server id 4 end_log_pos 172191357 Xid = 98697502

COMMIT/!/; `

Shukla-Ankur commented 1 year ago

@osheroff this happened again. This time, this happened for multiple tables where we had updated the column type. Weird thing is that the alters were run on 9th Dec. We continued to publish CDC correctly after that till today. I could see updated entry in the schemas table of maxwell. I have kept the maxwell DB this time. Let me know what data you need for this.

osheroff commented 1 year ago

let's see what's in maxwell.columns for the table(s) in questions, as well as everything in maxwell.schemas.

Wondering aloud -- could schema compaction trigger this? Anything in the maxwell logs about "schema compaction"?

Shukla-Ankur commented 1 year ago

Latest error recorded 6335 [ERROR] BinlogConnectorReplicator: Unable to cast 1592752282412193923 (java.lang.Long) into column sample_db.module.tenant_id (type 'varchar') 6379 [ERROR] Maxwell: checking for schema inconsistencies in sample_db.module 6416 [ERROR] Maxwell: sample_db.module.tenant_id has a type mismatch, varchar vs bigint in new 6416 [ERROR] Maxwell: sample_db.module.tenant_id has a type mismatch, bigint vs varchar in old

For the DB in latest corruption, I see 2 entries in maxwell.databases table | 13 | 1 | sample_db | utf8mb4

| 37 | 1465 | sample_db | utf8mb4

entry in the maxwell.tables for the DB table for which maxwell failed | 519 | 1465 | 36 | module

select * from columns where table_id=519\G *************************** 1. row *************************** id: 7522 schema_id: 1465 table_id: 519 name: f_cdc charset: NULL coltype: bit is_signed: 0 enum_values: NULL column_length: NULL *************************** 2. row *************************** id: 7523 schema_id: 1465 table_id: 519 name: company_id charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 3. row *************************** id: 7524 schema_id: 1465 table_id: 519 name: module_id charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 4. row *************************** id: 7525 schema_id: 1465 table_id: 519 name: name charset: utf8mb4 coltype: varchar is_signed: 0 enum_values: NULL column_length: NULL *************************** 5. row *************************** id: 7526 schema_id: 1465 table_id: 519 name: module_version charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 6. row *************************** id: 7527 schema_id: 1465 table_id: 519 name: max_score charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 7. row *************************** id: 7528 schema_id: 1465 table_id: 519 name: state charset: NULL coltype: enum is_signed: 0 enum_values: ["COPYING_FILES"] column_length: NULL *************************** 8. row *************************** id: 7529 schema_id: 1465 table_id: 519 name: created_by charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 9. row *************************** id: 7530 schema_id: 1465 table_id: 519 name: module_type charset: NULL coltype: enum is_signed: 0 enum_values: ["TEST"] column_length: NULL *************************** 10. row *************************** id: 7531 schema_id: 1465 table_id: 519 name: id charset: utf8mb4 coltype: varchar is_signed: 0 enum_values: NULL column_length: NULL *************************** 11. row *************************** id: 7532 schema_id: 1465 table_id: 519 name: created_time charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 12. row *************************** id: 7533 schema_id: 1465 table_id: 519 name: updated_time charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL *************************** 13. row *************************** id: 7534 schema_id: 1465 table_id: 519 name: tenant_id charset: utf8mb4 coltype: varchar is_signed: 0 enum_values: NULL column_length: NULL *************************** 14. row *************************** id: 7535 schema_id: 1465 table_id: 519 name: auth_id charset: utf8mb4 coltype: varchar is_signed: 0 enum_values: NULL column_length: NULL *************************** 15. row *************************** id: 7536 schema_id: 1465 table_id: 519 name: global_context_id charset: utf8mb4 coltype: varchar is_signed: 0 enum_values: NULL column_length: NULL *************************** 16. row *************************** id: 7537 schema_id: 1465 table_id: 519 name: version charset: NULL coltype: bigint is_signed: 1 enum_values: NULL column_length: NULL 16 rows in set (0.00 sec) As can be seen here, as per maxwell, the datatype is varchar.

in maxwell.schemas, there are 26 entries for same table, all alters, but all have same binlog and position. So below is the entry with latest heartbeat. this shows datatype as bigint ` select * from maxwell.schemas | id | binlog_file | binlog_position | last_heartbeat_read | gtid_set | base_schema_id | deltas | server_id | position_sha | charset | version | deleted |

3266 | binlog.000129 | 89853139 | 1670582038626 | 06536c13-9fea-11ea-b1ec-0aea5c2a1d50:1-94880658,22d98918-6ad4-4f69-bdbf-1a01018f6550:1-192810486,23e289a8-1032-4d18-ae3b-774065c72da2:1-8133103,5841c312-0b74-11ea-949e-0240b5d03dd2:1-16003382,ff85281a-3a7d-11ed-8647-0a12595b8fce:1-2468 | 3254 | [{"type":"table-alter","database":"sample_db","table":"module","old":{"database":"sample_db","charset":"utf8mb4","table":"module","primary-key":["tenant_id","company_id","module_id"],"columns":[{"type":"varchar","name":"id","charset":"utf8mb4"},{"type":"bigint","name":"created_time","signed":true},{"type":"bigint","name":"updated_time","signed":true},{"type":"varchar","name":"tenant_id","charset":"utf8mb4"},{"type":"varchar","name":"global_context_id","charset":"utf8mb4"},{"type":"varchar","name":"auth_id","charset":"utf8mb4"},{"type":"bigint","name":"version","signed":true},{"type":"bit","name":"f_cdc"},{"type":"bigint","name":"company_id","signed":true},{"type":"bigint","name":"module_id","signed":true},{"type":"varchar","name":"name","charset":"utf8mb4"},{"type":"bigint","name":"module_version","signed":true},{"type":"bigint","name":"max_score","signed":true},{"type":"enum","name":"state","enum-values":["TEST"]},{"type":"bigint","name":"created_by","signed":true},{"type":"enum","name":"module_type","enum-values":["TEST"]}]},"def":{"database":"sample_db","charset":"utf8mb4","table":"module","primary-key":["tenant_id","company_id","module_id"],"columns":[{"type":"varchar","name":"id","charset":"utf8mb4"},{"type":"bigint","name":"created_time","signed":true},{"type":"bigint","name":"updated_time","signed":true},{"type":"bigint","name":"tenant_id","signed":true},{"type":"varchar","name":"global_context_id","charset":"utf8mb4"},{"type":"varchar","name":"auth_id","charset":"utf8mb4"},{"type":"bigint","name":"version","signed":true},{"type":"bit","name":"f_cdc"},{"type":"bigint","name":"company_id","signed":true},{"type":"bigint","name":"module_id","signed":true},{"type":"varchar","name":"name","charset":"utf8mb4"},{"type":"bigint","name":"module_version","signed":true},{"type":"bigint","name":"max_score","signed":true},{"type":"enum","name":"state","enum-values":["COPYING_FILES","COPYING_DB","DRAFT","PUBLISHED","ARCHIVED","DELETED","COPY_FAILED"]},{"type":"bigint","name":"created_by","signed":true},{"type":"enum","name":"module_type","enum-values":["TEST"]}]}}] | 2 | c66fea01ef29d6b2670bc17f7528fc30272992cd | utf8mb4 | 4 | 0 | `

Shukla-Ankur commented 1 year ago

@osheroff ^

Shukla-Ankur commented 1 year ago

@osheroff any suggestions here?

osheroff commented 1 year ago

sorry been busy @ day job.

I can't see anything from what you sent. I'm wondering if you could mysqldump your "maxwell" database and send it to me -- I'm ben dot osheroff at gmail dot com if you want to send me the file. Can provide a pgp key if you like.

Shukla-Ankur commented 1 year ago

will share after sanitizing the data