yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 479 forks source link

Be fault tolerant : provide a multi-instance setup of mongo-connector #275

Open joe-mojo opened 9 years ago

joe-mojo commented 9 years ago

When you have a replicaset, it is for fault tolerance. When you have a mongo-connector feeding an ElasticSearch cluster, you want it up-to-date, so you cannot accept mongo-connector failing along with the mongod instance it is tailing.

In order to do so, you have to launch 1 mongo-connector per mongod instance. 1st issue : each mongo-connector is reporting the same update when the update is replicated on the instance the connector is tailing. Hopefully for me ElasticSearch is idempotent on doc insert if the id field has been correctly set. 2nd issue : If the first instance legitimately does a full dump at start, the second instance of mongo-connector that has no oplog-ts will start a dump to. Sadly, My ElasticSearch version is not idempotent on this bulk update and is inserting duplicate docs !

It could be nice if there was an option to use a db/collection reserved for storing oplog-ts. You could then have multiple mongo-connector by replica-set without worrying if it will do 3 times the same work, wasting resources and eventually creating dups. In the mean time, you could update the doc in order to clearly indicate the better alternatives to run a multi-connector setup.

One guy at MongoDB, Inc. is aware of this issue.

methuz commented 9 years ago

I want this feature too. And i'm planing to implement this feature soon.

FYI I'm looking for new Elasticsearch River replacement since it is deprecated and I want more reliable (Sometimes River cannot resume it's progress it just freeze). Candidates are mongo-connector, transporter(Go), Mongoosastic(Nodejs). All of them are not fit with my requirement so I have to and some code anyway but mongo-connector seems to be the best since it already have oplog timestamp I only have to keep it in a db and solve how each instance work together. But the problem is I have Go and Nodejs experience but I have never use Python in a big project. I will have to decide soon because I need it in 2 months. Any suggestion?

llvtt commented 9 years ago

@methuz I'm glad you're willing to work on this feature. Once you think you've made as much progress as you can, feel free to submit a pull request, and I can review your code. I'm not very familiar with the other projects you mentioned (transporter and mongoosastic), so I can't make a suggestion as to which project or mongo-connector will be most suitable for your needs.

joe-mojo commented 9 years ago

As mongo-connector can connect to a replica-set with a mongo connection string, it shouldn't be affected by a mongod being down, but I don't know if it can handler the doc-manager URL being down : it doesn't connect to ElasticSearch cluster but to a specific host.

So in order to handle this case, mongo-connector should accept multiple target URL for doc manager and be awar of cluster so that it is not affected by a single instance failure.

I think this is part of the feature, but you could distinct the case of mongo-connector failure/network partition in one feature and doc_maganer cluster in another one.

MasterMind2k commented 8 years ago

Was any progress done on this issue?

ShaneHarvey commented 7 years ago

mongo-connector does it's best to be resilient against transient connection failures to the source and target(s) but if mongo-connector or the machine itself fails, there's no mechanism for another machine running mongo-connector to automatically takeover. At this time there are no plans to make it so.

ShaneHarvey commented 7 years ago

@joe-mojo @MasterMind2k starting in version 0.3.0 the Elasticsearch doc managers can connect to an Elasticsearch cluster with multiple hosts, see https://github.com/mongodb-labs/mongo-connector/wiki/Usage-with-Elasticsearch#connecting-to-multiple-elasticsearch-hosts.

joe-mojo commented 7 years ago

Nice. But the main issues (as discussed earlier) remain.

Joachim

Le 24 janv. 2017 à 21:31, Shane Harvey notifications@github.com a écrit :

@joe-mojo @MasterMind2k starting in version 0.3.0 the Elasticsearch doc managers can connect to an Elasticsearch cluster with multiple hosts, see https://github.com/mongodb-labs/mongo-connector/wiki/Usage-with-Elasticsearch#connecting-to-multiple-elasticsearch-hosts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ulucsahinalp commented 7 years ago

There's a mechanism for another machine running mongo-connector to automatically takeover. Using Zookeper and dockerizing the mongo-connector can create a multi-node fault-tolerant mongo-connector architecture easily...

jamesjjk commented 7 years ago

@ulucsahinalp Could you exapnd on the failover setup you are describing? Would be great to create a readily available docker-image for this.