zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer
https://maxwells-daemon.io/
Other
4.02k stars 1.01k forks source link

Package and release Maxwell as a jar #291

Closed xmlking closed 7 years ago

xmlking commented 8 years ago

@osheroff just checking if there is any interest or concerns to wrap maxwell as NiFi processor.

Apache NiFi is plugin based system that lets developers to build their own custom processors and use them along with built-in processors , to compose dataflows using web based user interface.
I have been using Maxwell in MySQL --> Maxwell --> Kafka --> NiFi --> Hadoop(HDFS) dataflow.

Benefits:

  1. Maxwell as a NiFi processor can be configured via web based tool.
  2. Can be paused and restarted via web based interface.
  3. Users have choice to build their own dataflows ( not just limited to console, file and kafka)
  4. Maxwell as a NiFi processor can be re-pointed to new MySQL master, automatically. Maxwell processor can be configured to run on primary node in NiFi cluster. (automatically move Maxwell processer to second NiFi node when primary node is down).
  5. Complex dataflows are easy to implement (e.g. filtering , transformation , aggregation, routing , storing to MongoDB or HBase etc)
osheroff commented 8 years ago

@xmlking, I don't directly have a personal interest in NiFi (I'm honestly waiting for the stream-processing-framework wars to shake out a bit before committing to any), but I'd be very interested in work that allowed us to package Maxwell as a library for external consumption, which would allow for a maxwell-nifi project/package that relies on maxwell-core -- ie the work of abstracting out things into reasonably generic interfaces enough so that it's easy to build maxwell into various frameworks.

@wushujames has, along similar lines, been working on something where he packages maxwell's brain into a kafka-connect library... mabybe there's some overlap there?

wushujames commented 8 years ago

@xmlking, when you say Apache Nifi is a "plugin based system", exactly what do you mean? Does Nifi just execute a subprocess and the subprocess can be written in any language? Or does Nifi load a plugin as a java class, for example? Just trying to find out where the boundary of the "plugin" is.

Yes, I'd like to eventually be able to package Maxwell as a library. The current implementation then would become an application that uses the maxwell-core library, and we can have a kafka-connect plugin that uses maxwell-core and an Nifi plugin that uses maxwell-core, etc.

I haven't made much progress on that front, unfortunately, but I have thought about it a bunch.

xmlking commented 8 years ago

@wushujames NiFi comes with 100+ built-in processes (bundled as .nar files in /lib directory). Users use canvas to drag and drop processers to visually build dataflows. We can build our own custom processes and drop it (e.g. nifi-maxwell.nar) in /lib directory to make it available in the canvas. For example, I created nifi-websocket processor wrapping Vert.X. NiFi API is pull based(iterator mode), I am thinking Maxwell can writer to LinkedBlockingQueue and NiFi can consume from that queue. The better option would be making maxwell-core iterable so that NiFi drive the data retrieval.

osheroff commented 8 years ago

@xmlking I'm hijacking your issue and turning this into "figure out how to make maxwell a library". Hope you don't mind.

wushujames commented 8 years ago

Sorry it took me long to respond about this. Here's what I think a Maxwell library would look like:

As an aside: Conceptually, if you had the raw binlog events in a binlog "stream", and the schema registry events in a kafka "stream", then I think you are doing a "streaming join" of the two streams to compute schema'd binlog events.

xmlking commented 8 years ago

_Pull vs. Push API_

ReactiveX provides async API that is well suited for implementing streaming sources like Maxwell. with composable higher-order functions, end users can easily implement functionality such as blacklist, whitelist database , tables , retry connection etc.

By adopting RxJava or Reactive Streams API , developers can chose either pull or push behavior. Those SDKs also support back-pressure to auto-tune data flow velocity. I think Reactive Streams API will become part of Java 9 Flow API.

_Emit full Schema_
In one of my use case, I need full Schema info i.e., VARCHAR(50) VARCHAR(256) etc. wish we can capture and emit those field length information also in the Schema events

If we implement ReactiveX API, Maxwell will be exposing couple of Observables (rx_binlog, rx_bootstrap, rx_schema etc) that users can subscribe on main thread or other compute thread declaratively. Maxwell can connect/disconnect from MySQL based on subscribe & unsubscribe events.

osheroff commented 7 years ago

this was completed, maxwell-as-a-jar is up in the maven repo.