Package and release Maxwell as a jar

xmlking commented 8 years ago

@osheroff just checking if there is any interest or concerns to wrap maxwell as NiFi processor.

Apache NiFi is plugin based system that lets developers to build their own custom processors and use them along with built-in processors , to compose dataflows using web based user interface.
I have been using Maxwell in MySQL --> Maxwell --> Kafka --> NiFi --> Hadoop(HDFS) dataflow.

Benefits:

Maxwell as a NiFi processor can be configured via web based tool.
Can be paused and restarted via web based interface.
Users have choice to build their own dataflows ( not just limited to console, file and kafka)
Maxwell as a NiFi processor can be re-pointed to new MySQL master, automatically. Maxwell processor can be configured to run on primary node in NiFi cluster. (automatically move Maxwell processer to second NiFi node when primary node is down).
Complex dataflows are easy to implement (e.g. filtering , transformation , aggregation, routing , storing to MongoDB or HBase etc)

osheroff commented 8 years ago

@xmlking, I don't directly have a personal interest in NiFi (I'm honestly waiting for the stream-processing-framework wars to shake out a bit before committing to any), but I'd be very interested in work that allowed us to package Maxwell as a library for external consumption, which would allow for a maxwell-nifi project/package that relies on maxwell-core -- ie the work of abstracting out things into reasonably generic interfaces enough so that it's easy to build maxwell into various frameworks.

@wushujames has, along similar lines, been working on something where he packages maxwell's brain into a kafka-connect library... mabybe there's some overlap there?

wushujames commented 8 years ago

@xmlking, when you say Apache Nifi is a "plugin based system", exactly what do you mean? Does Nifi just execute a subprocess and the subprocess can be written in any language? Or does Nifi load a plugin as a java class, for example? Just trying to find out where the boundary of the "plugin" is.

Yes, I'd like to eventually be able to package Maxwell as a library. The current implementation then would become an application that uses the maxwell-core library, and we can have a kafka-connect plugin that uses maxwell-core and an Nifi plugin that uses maxwell-core, etc.

I haven't made much progress on that front, unfortunately, but I have thought about it a bunch.

xmlking commented 8 years ago

@wushujames NiFi comes with 100+ built-in processes (bundled as .nar files in /lib directory). Users use canvas to drag and drop processers to visually build dataflows. We can build our own custom processes and drop it (e.g. nifi-maxwell.nar) in /lib directory to make it available in the canvas. For example, I created nifi-websocket processor wrapping Vert.X. NiFi API is pull based(iterator mode), I am thinking Maxwell can writer to LinkedBlockingQueue and NiFi can consume from that queue. The better option would be making maxwell-core iterable so that NiFi drive the data retrieval.

osheroff commented 8 years ago

@xmlking I'm hijacking your issue and turning this into "figure out how to make maxwell a library". Hope you don't mind.

wushujames commented 8 years ago

Sorry it took me long to respond about this. Here's what I think a Maxwell library would look like:

Ideally, it would actually look very similar to open-replicator or https://github.com/shyiko/mysql-binlog-connector-java, but instead of emitting raw MySQL binlog events, it would emit schema'd, strongly-typed Maxwell objects. It should NOT return the JSON-ified rows. It should return objects that have the field names, mysql-types (i.e. field type, field width, signed/unsigned), and values. This would give the maximum amount of information to the user of the library.
I think it'd be nice for there for be a single poll() interface. The poll() interface should provide the next event(s) from the binlog, OR the next event(s) from the bootstrapping process (if one is running). The user of the library would just poll() and not care whether they came from the bootstrapping process or the binlog. All current Producers would use this new poll() interface. I don't know what that means for the async bootstrapper. I think the async bootstrapper becomes no-longer possible.
A main part of the library is schema parsing and management and saving/reloading. I think that there should be an option to NOT store that data back in the originating mysql cluster, or require ANY mysql cluster for schema management. It would be kind of difficult, but it'd be nice to store the data in kafka itself, in a "mysql schemas" topic. The idea here is, when you create a new schema, you write it out to the topic. When you crash and restart, you reload the entire topic into memory and answer any schema queries from an in-memory store. This is similar to the Confluent schema registry: http://docs.confluent.io/2.0.0/schema-registry/docs/design.html
If possible or desired, you could even make the "schema registry" be an interface. The current implemention of Maxwell would implement a schema registry by storing things in mysql. A kafka connect, or other, implementation could store the schema registry in another type of data store. Yes, I realize that this means you essentially have TWO "library interfaces" to implement (one for Maxwell, one for the schema registry), so maybe that's too much work for not much gain.

As an aside: Conceptually, if you had the raw binlog events in a binlog "stream", and the schema registry events in a kafka "stream", then I think you are doing a "streaming join" of the two streams to compute schema'd binlog events.

xmlking commented 8 years ago

_Pull vs. Push API_

ReactiveX provides async API that is well suited for implementing streaming sources like Maxwell. with composable higher-order functions, end users can easily implement functionality such as blacklist, whitelist database , tables , retry connection etc.

By adopting RxJava or Reactive Streams API , developers can chose either pull or push behavior. Those SDKs also support back-pressure to auto-tune data flow velocity. I think Reactive Streams API will become part of Java 9 Flow API.

_Emit full Schema_
In one of my use case, I need full Schema info i.e., VARCHAR(50) VARCHAR(256) etc. wish we can capture and emit those field length information also in the Schema events

If we implement ReactiveX API, Maxwell will be exposing couple of Observables (rx_binlog, rx_bootstrap, rx_schema etc) that users can subscribe on main thread or other compute thread declaratively. Maxwell can connect/disconnect from MySQL based on subscribe & unsubscribe events.

osheroff commented 7 years ago

this was completed, maxwell-as-a-jar is up in the maven repo.

zendesk / maxwell

Package and release Maxwell as a jar #291