yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 479 forks source link

Running mongo-connector in production #581

Open leonidez opened 8 years ago

leonidez commented 8 years ago

First, my setup:

Elasticsearch: set up through Bonsai on Heroku Mongo: set up through mLab as a replica set mongo-connector: set up in a Heroku app and initialized via Procfile

Using this setup I am able to get mongo-connector up and running in production. It currently has about 8 million documents (about 4 million of which are mongodb_meta items). However I notice that even though I see a large amount of sync activity going on (and this has been going on over several days) the majority of the connector's activity seems to be creating a lot of PUT requests for items that are already in elasticsearch. At this time it's about a third of the way done but it is proceeding much slower than it did when the sync first started.

Before setting this up in production I ran a local test with a smaller set of data (about 10 percent) and was able to perform the sync and verify the results. I wonder if there could be an issue related to running the connector on Heroku (where I don't have total control over the oplog.timestamp). I have a ticket open on their end but wanted to find out what the word is here. Thanks.

ShaneHarvey commented 8 years ago

Unfortunately, I'm not familiar with Heroku. Do you happen to know if mongo-connector is getting restarted? Can you post log files? If the oplog.timestamp file does not exist when mongo-connector (re)starts, it will dump all the specified collections and start tailing the oplog.

Is the MongoDB data being updated? The elastic-doc-managers reindex the entire document to replicate updates from MongoDB. That may be one reason why you're seeing many PUT request on the same ids. What is the update/insert activity in MongoDB?

Also, can you calculate how fast mongo-connector is syncing data? Either during the collection dump, or when tailing the oplog.

leonidez commented 8 years ago

@ShaneHarvey I've just heard back from Heroku, and as I expected the filesystem is clean every time they do a build or restart. You do have access to the FS while your dyno is running, which means mongo-connector can make the oplog.timestamp file, it just won't persist.

To answer your questions:

Thank you for following up with me. I'll drop a note here when I have more data.

harlandjp commented 6 years ago

Hi @leonidez Did you have any success using mongo-connector running on a Heroku dyno? I'm planning to do the same, please do share. Thanks