zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4.01k stars 1.01k forks source link

Couldn't find table X in database Y #2100

Closed jessedobbelaere closed 2 months ago

jessedobbelaere commented 3 months ago

We've been using Maxwell for years. It has been running very stable all that time, except for memory issues caused by https://github.com/zendesk/maxwell/issues/2062 which got resolved by adding max_schemas=100 👌

We use AWS Aurora Mysql instances, where we have 1000+ databases with similar schemas (multi-tenancy). There has not been any new infra, binlog, database change or a setting change or version bumped in Maxwell 🤔

This week, we got alerted that our two Maxwell v1.41.0 containers started erroring java.lang.RuntimeException: Couldn't find table users in database db_XXX_40608 and the other instance with a different db and table missing:

Logs of the Maxwell container ``` 2024-06-28 08:38:37 INFO Scripting - using function process_row from /etc/config/maxwell-filter.js 2024-06-28 08:38:37 INFO MaxwellMetrics - JMX metrics reporter enabled 2024-06-28 08:38:37 WARN MaxwellMetrics - JMX remote is disabled 2024-06-28 08:38:37 INFO MaxwellMetrics - Enabling UDP Datadog reporting with host 10.100.108.191, port 8125 2024-06-28 08:38:37 INFO MaxwellMetrics - Datadog reporting enabled 2024-06-28 08:38:40 INFO Maxwell - Starting Maxwell. maxMemory: 3113877504 bufferMemoryUsage: 0.25 2024-06-28 08:38:41 INFO ProducerConfig - ProducerConfig values: acks = 1 batch.size = 16384 bootstrap.servers = [b-1.XXX.c2.kafka.eu-central-1.amazonaws.com:9092, b-2.XXX.c2.kafka.eu-central-1.amazonaws.com:9092] buffer.memory = 33554432 client.id = compression.type = none connections.max.idle.ms = 540000 enable.idempotence = false interceptor.classes = null key.serializer = class org.apache.kafka.common.serialization.StringSerializer linger.ms = 0 max.block.ms = 60000 max.in.flight.requests.per.connection = 5 max.request.size = 1048576 metadata.max.age.ms = 300000 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 30000 partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner receive.buffer.bytes = 32768 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 30000 retries = 0 retry.backoff.ms = 100 sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.mechanism = GSSAPI security.protocol = PLAINTEXT send.buffer.bytes = 131072 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.endpoint.identification.algorithm = null ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.protocol = TLS ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS transaction.timeout.ms = 60000 transactional.id = null value.serializer = class org.apache.kafka.common.serialization.StringSerializer 2024-06-28 08:38:41 INFO AppInfoParser - Kafka version : 1.0.0 2024-06-28 08:38:41 INFO AppInfoParser - Kafka commitId : aaa7af6d4a11b29d 2024-06-28 08:38:41 INFO Maxwell - Maxwell v1.41.0 is booting (MaxwellKafkaProducer), starting at Position[BinlogPosition[mysql-bin-changelog.001213:59380184], lastHeartbeat=0] 2024-06-28 08:38:42 INFO MysqlSavedSchema - Restoring schema id 851 (last modified at Position[BinlogPosition[mysql-bin-changelog.001213:58812461], lastHeartbeat=0]) 2024-06-28 08:40:01 INFO BinlogConnectorReplicator - Setting initial binlog pos to: mysql-bin-changelog.001213:59380184 2024-06-28 08:40:01 INFO MaxwellHTTPServer - Maxwell http server starting 2024-06-28 08:40:01 INFO MaxwellHTTPServer - Maxwell http server started on port 8080 2024-06-28 08:40:02 INFO HostnameChecker$1 - attempting to verify SSL identity 'XXX-mysql-staging-aurora-cluster.cluster-XXX.eu-central-1.rds.amazonaws.com' against cns: [XXX-mysql-XXX-1.czoujc6bolvx.eu-central-1.rds.amazonaws.com], subject-alts: [XXX-mysql-XXX-1.czoujc6bolvx.eu-central-1.rds.amazonaws.com,XXX-mysql-staging-aurora-cluster.cluster-XXX.eu-central-1.rds.amazonaws.com,XXX-mysql-staging-aurora-cluster.cluster-ro-XXX.eu-central-1.rds.amazonaws.com] 2024-06-28 08:40:02 INFO BinaryLogClient - SSL enabled 2024-06-28 08:40:02 INFO Server - jetty-10.0.14; built: 2023-02-22T23:12:32.272Z; git: 976721d0f3e903a243584d47870ad2f2c1bf9e55; jvm 11.0.16+8 2024-06-28 08:40:02 INFO BinaryLogClient - Connected to XXX-mysql-staging-aurora-cluster.cluster-XXX.eu-central-1.rds.amazonaws.com:3306 at mysql-bin-changelog.001213/59380184 (sid:6001, cid:949330) 2024-06-28 08:40:02 INFO BinlogConnectorReplicator - Binlog connected. 2024-06-28 08:40:02 INFO BinlogConnectorReplicator - Binlog disconnected. 2024-06-28 08:40:02 INFO TaskManager - Stopping 5 tasks 2024-06-28 08:40:02 ERROR TaskManager - cause: java.lang.RuntimeException: Couldn't find table users in database db_XXX_40608 at com.zendesk.maxwell.replication.TableCache.processEvent(TableCache.java:35) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getTransactionRows(BinlogConnectorReplicator.java:592) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getRow(BinlogConnectorReplicator.java:727) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.replication.BinlogConnectorReplicator.work(BinlogConnectorReplicator.java:235) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:34) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.Maxwell.startInner(Maxwell.java:302) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.Maxwell.start(Maxwell.java:227) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.Maxwell.main(Maxwell.java:337) ~[maxwell-1.41.0.jar:1.41.0] 2024-06-28 08:40:02 INFO TaskManager - Stopping: com.zendesk.maxwell.schema.PositionStoreThread@18d7e2a7 2024-06-28 08:40:02 INFO TaskManager - Stopping: com.zendesk.maxwell.producer.MaxwellKafkaProducerWorker@ea5957d 2024-06-28 08:40:02 INFO KafkaProducer - [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. 2024-06-28 08:40:02 INFO TaskManager - Stopping: com.zendesk.maxwell.bootstrap.BootstrapController@1dd3d26b 2024-06-28 08:40:02 INFO TaskManager - Stopping: com.zendesk.maxwell.replication.BinlogConnectorReplicator@7650e3d1 2024-06-28 08:40:02 INFO TaskManager - Stopping: com.zendesk.maxwell.monitoring.MaxwellHTTPServerWorker@5135e566 java.lang.RuntimeException: Couldn't find table users in database db_XXX_40608 at com.zendesk.maxwell.replication.TableCache.processEvent(TableCache.java:35) at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getTransactionRows(BinlogConnectorReplicator.java:592) at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getRow(BinlogConnectorReplicator.java:727) at com.zendesk.maxwell.replication.BinlogConnectorReplicator.work(BinlogConnectorReplicator.java:235) at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:34) at com.zendesk.maxwell.Maxwell.startInner(Maxwell.java:302) at com.zendesk.maxwell.Maxwell.start(Maxwell.java:227) at com.zendesk.maxwell.Maxwell.main(Maxwell.java:337) 2024-06-28 08:40:02 ERROR Maxwell - Maxwell saw an exception and is exiting... java.lang.RuntimeException: Couldn't find table users in database db_XXX_40608 at com.zendesk.maxwell.replication.TableCache.processEvent(TableCache.java:35) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getTransactionRows(BinlogConnectorReplicator.java:592) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getRow(BinlogConnectorReplicator.java:727) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.replication.BinlogConnectorReplicator.work(BinlogConnectorReplicator.java:235) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:34) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.Maxwell.startInner(Maxwell.java:302) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.Maxwell.start(Maxwell.java:227) ~[maxwell-1.41.0.jar:1.41.0] at com.zendesk.maxwell.Maxwell.main(Maxwell.java:337) [maxwell-1.41.0.jar:1.41.0] 2024-06-28 08:40:02 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@45b7cac3{/,null,AVAILABLE} 2024-06-28 08:40:02 INFO AbstractConnector - Started ServerConnector@41607b4d{HTTP/1.1, (http/1.1)}{0.0.0.0:8080} 2024-06-28 08:40:02 INFO Server - Started Server@2f1f319a{STARTING}[10.0.14,sto=0] @87310ms 2024-06-28 08:40:02 INFO Server - Stopped Server@2f1f319a{STOPPING}[10.0.14,sto=0] 2024-06-28 08:40:02 INFO AbstractConnector - Stopped ServerConnector@41607b4d{HTTP/1.1, (http/1.1)}{0.0.0.0:8080} 2024-06-28 08:40:02 INFO ContextHandler - Stopped o.e.j.s.ServletContextHandler@45b7cac3{/,null,STOPPED} 2024-06-28 08:40:03 INFO TaskManager - Stopped all tasks ```
Maxwell config ``` --kafka_version=1.0.0 --log_level=info --host=$(MAXWELL_DB_HOST) --user=$(MAXWELL_DB_USERNAME) --password=$(MAXWELL_DB_PASSWORD) --port=3306 --schema_database=maxwell_XXX_aurora_staging --client_id=maxwell-XXX --ssl=VERIFY_IDENTITY --jdbc_options=zeroDateTimeBehavior=convertToNull&connectTimeout=5000&enabledTLSProtocols=TLSv1.2 --schema_ssl=VERIFY_IDENTITY --schema_jdbc_options=zeroDateTimeBehavior=convertToNull&connectTimeout=5000&enabledTLSProtocols=TLSv1.2 --replica_server_id=6001 --replication_host=XXX-mysql-staging-aurora-cluster.cluster-XXX.eu-central-1.rds.amazonaws.com --replication_user=$(MAXWELL_REPLICATION_USERNAME) --replication_password=$(MAXWELL_REPLICATION_PASSWORD) --replication_port=3306 --replication_ssl=VERIFY_IDENTITY --replication_jdbc_options=zeroDateTimeBehavior=convertToNull&connectTimeout=5000&enabledTLSProtocols=TLSv1.2 --producer=kafka --kafka.bootstrap.servers=XXX --kafka_topic=maxwell-XXX-aurora --producer_partition_by=database --metrics_type=datadog,jmx,http --metrics_prefix=maxwell --metrics_datadog_type=udp --metrics_datadog_tags=environment:staging,cluster:XXX-mysql-aurora --metrics_datadog_interval=60 --metrics_datadog_host=$(DATADOG_HOST) --metrics_datadog_port=8125 --http_diagnostic=true '--filter=exclude: *.*, include: /db_XXX_.*$/.*' --javascript=/etc/config/maxwell-filter.js --max_schemas=100 ```

We dropped the Maxwell database to quick-fix it a few days ago. It recaptured the schema and worked fine. After a day, the same error happens again in one of the containers.

I actually checked and that db_XXX_40608.users table exists on the mysql host. The table is not new, it has existed since 2020. In a separate Maxwell container (different replica host, different binlogs), the same thing happened with another table.

I checked the Maxwell schema:

-- Check if users table exist in Maxwell for the db that throws the error
SELECT t.*
FROM tables AS t
INNER JOIN `databases` AS d ON t.database_id = d.id
WHERE d.name = 'db_XXX_40608' AND t.name = 'users';
-- Returns no results. It does return results for other tables. The users table is found in other similar databases.

-- Check if users table exists in every database
SELECT DISTINCT d.name, t.*
FROM `databases` AS d
LEFT JOIN tables AS t ON t.database_id = d.id AND t.name = 'users'
WHERE d.name LIKE 'db_XXX_%' AND t.id IS NULL
-- Shows 4 databases where the users table is missing in the Maxwell schema. It exists in the mysql source. Databases exist for months or years already.

-- Check if other tables are missing for some of the databases
SELECT DISTINCT d.name, t.*
FROM `databases` AS d
LEFT JOIN tables AS t ON t.database_id = d.id AND t.name = 'someothertable'
WHERE d.name LIKE 'db_XXX_%' AND t.id IS NULL
-- Shows 1 database where the table is missing in the Maxwell schema. It exists in the mysql source.

Update: I upgraded 1.41.0 to 1.41.1, tried temporarily removing --max_schemas=100 and adding --recapture_schema and it seems to have fixed the table that was originally missing db_XXX_40608.users. It's reading binlogs again, no crash in last 30min ✅

Any idea on a root cause? Those specific databases and tables exist for years, and we don't see recent schema changes have happened, so likely there were no DDL in the binlogs. We already removed the internal maxwell db two days ago when we first experienced such errors, so it should have recaptured the schema then, but today the db_XXX_40608.users was not found in the Maxwell tables table.

jessedobbelaere commented 3 months ago

Today we got hit by this one. Quite weird this recent spike of issues 🤔

2024-07-03 14:37:50 INFO  BinlogConnectorReplicator - Binlog connected.
2024-07-03 14:37:50 INFO  Server - jetty-10.0.14; built: 2023-02-22T23:12:32.272Z; git: 976721d0f3e903a243584d47870ad2f2c1bf9e55; jvm 11.0.16+8
2024-07-03 14:37:50 INFO  BinlogConnectorReplicator - Binlog disconnected.
2024-07-03 14:37:50 INFO  TaskManager - Stopping 5 tasks
2024-07-03 14:37:50 ERROR TaskManager - cause:
java.lang.RuntimeException: Couldn't find database db_39321
    at com.zendesk.maxwell.replication.TableCache.processEvent(TableCache.java:28) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getTransactionRows(BinlogConnectorReplicator.java:592) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getRow(BinlogConnectorReplicator.java:727) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.replication.BinlogConnectorReplicator.work(BinlogConnectorReplicator.java:235) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:34) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.Maxwell.startInner(Maxwell.java:302) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.Maxwell.start(Maxwell.java:227) ~[maxwell-1.41.1.jar:1.41.1]
    at com.zendesk.maxwell.Maxwell.main(Maxwell.java:337) ~[maxwell-1.41.1.jar:1.41.1]

I re-ran with --recapture_schema and I notice 2024-07-03 15:06:57 DEBUG SchemaCapturer - 685/1187 Capturing db_39321... in the logs. But it still crashes with the java.lang.RuntimeException: Couldn't find database db_39321 error sadly.

When I query SELECT * FROM databases WHERE name = 'db_39321' I can see:

| id    | schema_id | name     | charset |
|-------|-----------|----------|---------|
| 46389 |     13669 | db_39321 | utf8    |
| 47576 |     13669 | db_39321 | utf8    |
| 47911 |     13669 | db_39321 | utf8    |
| 48495 |     13925 | db_39321 | utf8    |
| 48691 |     13924 | db_39321 | utf8    |

But for other databases I see that they are also part of schema with ID 13932, which is not the case for db_39321

Could the --max-schemas=100 be related here? Which I added because of memory issues

osheroff commented 3 months ago

ok so there's two different issues here. Might be related, might not. Let's try to pick apart the first one first.

Do you have a sense of how frequently db_XXX_40608.users gets written to? I'm wondering how long the table was absent from the schema before the error went blam.

Your hunch that --max-schemas (ie the schema compaction code) could be at fault here at least passes a sniff test; do you have logs for when the last schema compaction happened in relation to these errors? It's possible that somehow these tables or databases failed to write to the database when the schema compacted?

are you running multiple maxwell clients that share the view of the schema? knowing that would help isolate down some variables.

jessedobbelaere commented 2 months ago

Hmm it's staging, so there's not a huge load although i could see some spikes in load (E2E tests perhaps). It happened on both our Maxwell instances which have different database clusters, different binlogs which is definitely weird. However they share the same maxwell internal db instance.

are you running multiple maxwell clients that share the view of the schema? knowing that would help isolate down some variables.

We have two maxwell clients/containers that share the same RDS instance but have their own maxwell database to keep state in.

While inspecting the logs, I also noticed that we have memory issues on the maxwell mysql db which runs on a db.t3.small AWS RDS Aurora instance (2vCPU, 2GB memory). It only contains two maxwell databases for the two instances we have. So it might be that we need a bigger instance. I could see this error more than 70+ times betwen June 25 and June 28. Maybe it caused the maxwell database to go in a bad state with all the restarts and mysql crashes.

2024-06-28 11:47:56 ERROR BootstrapController - got SQLException trying to bootstrap
java.sql.SQLException: Out of memory; check if mysqld or some other process uses all available memory; if not, you may have to use 'ulimit' to allow mysqld to use more memory or you can add more swap space
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129) ~[mysql-connector-java-8.0.28.jar:8.0.28]
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) ~[mysql-connector-java-8.0.28.jar:8.0.28]
at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953) ~[mysql-connector-java-8.0.28.jar:8.0.28]
at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java:1009) ~[mysql-connector-java-8.0.28.jar:8.0.28]
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:1471) ~[c3p0-0.9.5.5.jar:0.9.5.5]
at com.zendesk.maxwell.bootstrap.BootstrapController.getIncompleteTasks(BootstrapController.java:123) ~[maxwell-1.41.1.jar:1.41.1]
at com.zendesk.maxwell.bootstrap.BootstrapController.doWork(BootstrapController.java:78) ~[maxwell-1.41.1.jar:1.41.1]
at com.zendesk.maxwell.bootstrap.BootstrapController.work(BootstrapController.java:70) ~[maxwell-1.41.1.jar:1.41.1]
at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:34) ~[maxwell-1.41.1.jar:1.41.1]
at com.zendesk.maxwell.MaxwellContext.lambda$startTask$0(MaxwellContext.java:350) ~[maxwell-1.41.1.jar:1.41.1]
at java.lang.Thread.run(Thread.java:829) [?:?]

These mysql OOM errors also appear with the at com.zendesk.maxwell.schema.MysqlSchemaCompactor.work(MysqlSchemaCompactor.java:43) ~[maxwell-1.41.1.jar:1.41.1] in the stacktrace so that might be the issue 😓

I'll be upgrading to a bigger instance and keeping an eye out on the logs 👀

jessedobbelaere commented 2 months ago

I upgraded the maxwell Aurora RDS to a bigger instance and will be monitoring it for the next few weeks. Will close it for now as I think it's likely related to that 👍