Open leonidez opened 8 years ago
Unfortunately, I'm not familiar with Heroku. Do you happen to know if mongo-connector is getting restarted? Can you post log files? If the oplog.timestamp
file does not exist when mongo-connector (re)starts, it will dump all the specified collections and start tailing the oplog.
Is the MongoDB data being updated? The elastic-doc-managers reindex the entire document to replicate updates from MongoDB. That may be one reason why you're seeing many PUT request on the same ids. What is the update/insert activity in MongoDB?
Also, can you calculate how fast mongo-connector is syncing data? Either during the collection dump, or when tailing the oplog.
@ShaneHarvey I've just heard back from Heroku, and as I expected the filesystem is clean every time they do a build or restart. You do have access to the FS while your dyno is running, which means mongo-connector can make the oplog.timestamp file, it just won't persist.
To answer your questions:
Thank you for following up with me. I'll drop a note here when I have more data.
Hi @leonidez Did you have any success using mongo-connector running on a Heroku dyno? I'm planning to do the same, please do share. Thanks
First, my setup:
Elasticsearch: set up through Bonsai on Heroku Mongo: set up through mLab as a replica set mongo-connector: set up in a Heroku app and initialized via Procfile
Using this setup I am able to get mongo-connector up and running in production. It currently has about 8 million documents (about 4 million of which are mongodb_meta items). However I notice that even though I see a large amount of sync activity going on (and this has been going on over several days) the majority of the connector's activity seems to be creating a lot of PUT requests for items that are already in elasticsearch. At this time it's about a third of the way done but it is proceeding much slower than it did when the sync first started.
Before setting this up in production I ran a local test with a smaller set of data (about 10 percent) and was able to perform the sync and verify the results. I wonder if there could be an issue related to running the connector on Heroku (where I don't have total control over the oplog.timestamp). I have a ticket open on their end but wanted to find out what the word is here. Thanks.