zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
3.95k stars 996 forks source link

Connect to Vitess #1757

Open Stormhand opened 2 years ago

Stormhand commented 2 years ago

Is it possible to read CDC from Vitess? I am not sure how Maxwell will read the MySQL binlogs behind it and store their files and GTID's in the meta db.

kovyrin commented 1 year ago

I'm currently investigating the same issue and it looks like the recommended way to consuming CDC events from Vitess is through VStream. That is what Debezium uses as well.

Unfortunately, AFAIU, VStream is a separate gRPC API and is not directly compatible with MySQL protocol, so Maxwell would need to be modified to implement a completely separate replicator (similar to the existing BinlogConnectorReplicator, but based on a gRPC client for the VStream API in vtgate).

kovyrin commented 1 year ago

@osheroff I have a (currently very much an MVP) implementation of a VStream connector that generates a stream of RowMap objects that I'm using to experiment with running my custom Maxwell producer on top of Vitess. It works pretty well in my testing and we're planning to start larger scale tests of that setup soon.

Would you be interested us adding basic support for Vitess into Maxwell or is that too out of scope for the product? The semantics are very similar to binlogs, but they may be different enough, so I am not sure if I should try bending Maxwell to support Vitess or just move away towards a fully custom solution.

Thank you for building such a great product!

osheroff commented 1 year ago

open up a WIP! I'd love to see it.

very happy to stretch Maxwell out a bit. postgres has long been on my wishlist too.

Obviously all the schema-storage-fu is all different. How does vitess do positions?

kovyrin commented 1 year ago

open up a WIP! I'd love to see it.

Cool, will try it.

very happy to stretch Maxwell out a bit. postgres has long been on my wishlist too.

Oh, nice, then it totally makes sense to do something with Vitess.

Obviously all the schema-storage-fu is all different.

It is actually really nice in VStream: they send you the schema within the stream before an event that uses that schema. Like, before getting a change event for a table X, you get a full schema for that table with all the data you need. So, at least for the stuff I've been doing, I have not encountered any need to to go MySQL for any schema-related stuff.

How does vitess do positions?

Pretty cool as well: they send you VTGID events with each transaction, which is a complex data structure that describes stream position across all the shards you're following (each shard is a separate MySQL cluster, so the VTGID has a GTID value for that specific cluster). Here is a simple example from my local dev environment:

      "type": "VGTID",
      "vgtid": {
        "shardGtids": [
            "keyspace": "commerce",
            "shard": "0",
            "gtid": "MySQL56/3a37ee54-5705-11ed-a848-db61527a6db3:1-1664"
      "keyspace": "commerce",
      "shard": "0"
kovyrin commented 1 year ago

Status update: I've spent the past couple of days understanding how Maxwell connects to a binlog and consumes binlog events and then adding VStream-based replicator designed as a drop-in replacement for MysqlReplicator. The result has been pushed to https://github.com/zendesk/maxwell/pull/1943 as going to be away on a vacation for the next week and wanted to show the current state to start the conversation.