yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 479 forks source link

How many records can process in one batch ? #585

Open sopnopriyo opened 7 years ago

sopnopriyo commented 7 years ago

Mongo Connector is frequently failing in production due to handle large number of transaction in MongoDB. In order to process large amount of operation in Database, what is the best configuration to be set in mongo-connector ? and What is the maximum number of records mongo-connector can process at one batch ?

sliwinski-milosz commented 7 years ago

Which doc manager do you use?

sopnopriyo commented 7 years ago

Hi @sliwinski-milosz ,

I have checked the version and it is 0.2

sliwinski-milosz commented 7 years ago

If you talk about elastic2_doc_manager, you can try the newest version (dev version) which uses bulk buffer for operations which improves performance. You can get if from master branch: https://github.com/mongodb-labs/elastic2-doc-manager

Please note that commit has been added 10 Nov - it has been reviewed very carefully but still has been used in only by few people in dev so there could be some hidden bugs.

But if your prod is failing anyway maybe it is worth to try. You can install it by using below cmd: pip install https://github.com/mongodb-labs/elastic2-doc-manager/archive/master.zip

Please remember to add autoCommitInterval as mentioned here: https://github.com/mongodb-labs/elastic2-doc-manager/issues/29

kanezch commented 7 years ago

@sliwinski-milosz Is there any recommanded number for the para 'auto-commit-interval'? I have installed the master.zip by following you instruction above, when i set 'auto-commit-interval=0', the sync from mongodb to elasticsearch is too slow, it is seams just 20~50 docs/s. when I set 'auto-commit-interval=1', everything is ok, I use cmd to wrtie to mongodb as below: SHARDING_2:PRIMARY> for(var i=0;i<50000;i++){db.test.insert({name:"new3", age:9999})} WriteResult({ "nInserted" : 1 }) when the writing has finished(about 100 seconds later), the sysnc has finished too.

so, why must use this 'auto-commit-interval'? if I didn't use it or set it to zero, the result is the same as before(when i didn't install the version you fixed)

sliwinski-milosz commented 7 years ago

When autoCommitInterval is:

  1. None -> that means that bulk buffer will be commit to ES after it will be filled... By default bulk size is 1000 docs. But sometimes, if there are no more updates for a while, bulk buffer might keep x docs in memory for quite long... waiting till it will be filled.
  2. Equal to 0: That means that it commits every document to ES immediately -> there is no buffering/bulking (old behaviour before bulking logic has been added).
  3. Equal to 1: Means that doc manager will commit docs from bulk buffer every 1 second or after buffer is filled. It uses bulk for commit. Thanks to that you noticed performance boost.

Normally it should be fine to set autoCommitInterval as 2 and keep default value of bulkSize.

Above simply means, that by setting autoCommitInterval > 0, you activate bulked commits to ES (at the time of writing bulk works only on dev version)

kanezch commented 7 years ago

@sliwinski-milosz very clearly. but I have another problem today, if I try to delete a large amount of documents, the sync is still too slow. I deleted all 100000 documents from mongodb using cmd db.test.remove({}), that finished in a blink of an eye, but the sync really took a long time (about a few minites)

ShaneHarvey commented 7 years ago

Although sliwinski-milosz's change also batches up delete operations, it is expected that the replication will take longer than the original delete in MongoDB. The reason is that deleting 100000 documents via db.test.remove({}) creates 100000 corresponding delete entries in MongoDB's oplog. A more efficient way to delete an entire collection is to perform db.test.drop() because it results in a single drop oplog entry. For example:

replset:PRIMARY> db.test.insert({'_id':1})
WriteResult({ "nInserted" : 1 })
replset:PRIMARY> db.test.insert({'_id':2})
WriteResult({ "nInserted" : 1 })
replset:PRIMARY> db.test.insert({'_id':3})
WriteResult({ "nInserted" : 1 })
replset:PRIMARY> db.test.remove({})
WriteResult({ "nRemoved" : 3 })
replset:PRIMARY> db.getSiblingDB('local').oplog.rs.find({}, {ts:0, v:0, h:0, t:0})
...
{ "op" : "c", "ns" : "test.$cmd", "o" : { "create" : "test" } }
{ "op" : "i", "ns" : "test.test", "o" : { "_id" : 1 } }
{ "op" : "i", "ns" : "test.test", "o" : { "_id" : 2 } }
{ "op" : "i", "ns" : "test.test", "o" : { "_id" : 3 } }
{ "op" : "d", "ns" : "test.test", "o" : { "_id" : 1 } }
{ "op" : "d", "ns" : "test.test", "o" : { "_id" : 2 } }
{ "op" : "d", "ns" : "test.test", "o" : { "_id" : 3 } }
...

Notice how there are three insert ("i") operations followed by three delete "d" operations. If we had issued a "drop" command, the result would look like:

replset:PRIMARY> db.test.drop()
true
replset:PRIMARY> db.getSiblingDB('local').oplog.rs.find({}, {ts:0, v:0, h:0, t:0})
...
{ "op" : "c", "ns" : "test.$cmd", "o" : { "create" : "test" } }
{ "op" : "i", "ns" : "test.test", "o" : { "_id" : 1 } }
{ "op" : "i", "ns" : "test.test", "o" : { "_id" : 2 } }
{ "op" : "i", "ns" : "test.test", "o" : { "_id" : 3 } }
{ "op" : "c", "ns" : "test.$cmd", "o" : { "drop" : "test" } }
...
sopnopriyo commented 7 years ago

Hi @sliwinski-milosz ,

Can you please give me a rough timeline when you are planning to push the changes to the stable release?

Thanks, Shahin

sliwinski-milosz commented 7 years ago

@sopnopriyo that is the question to @ShaneHarvey ;-). I guess that we need couple days more, just to be sure that no new issues will be reported by people who decided to try it.

And would be good to fix service-script bug before release, as it is very easy to reproduce that bug when bulk is onboard and autoCommitInterval is set to None.

Btw, @sopnopriyo have you tested new version?

kanezch commented 7 years ago

@ShaneHarvey I have used the 'drop()' function today. The result is below:

  1. at the beginning, there is 908828 docs in the estestdbnew.histclientverboses 11

  2. the sync to ES had finished lastnight, you can also see the count is the same as mongodb 12

  3. then I delete collection histclientverboses by cmd: SHARDING_2:PRIMARY> db.histclientverboses.drop()

after a few seconds, the count in ES became 883328(see the picture above)

  1. after about 6 minutes(btw, I just use my stopwatch on my phone), there was still 146828 docs in ES

14

waiting for a while, the number finally became zero 15

the mongoconnector cmd I used: [root@master mongo-connector]mongo-connector --auto-commit-interval=3 -m mongodb://xxxx:xxxx@172.27.8.118:40000 -t 172.27.8.132:9200 -d elastic2_doc_manager -n estestdbnew.histclientverboses --batch-size 1000 --verbose Logging to mongo-connector.log. No handlers could be found for logger "elasticsearch.trace" /usr/local/lib/python2.7/site-packages/elastic2_doc_manager-0.2.1.dev0-py2.7.egg/mongo_connector/doc_managers/elastic2_doc_manager.py:181: UserWarning: Deleting all documents of type histclientverboses on index estestdbnew.The mapping definition will persist and must beremoved manually. "removed manually." % (coll, db))

Is there any thing I can do to make it faster? Cause in my production environment, it is expected that 10000~30000 operations on mongodb per second, I think the sync between Mongo and ES is a bit slow now, especially the deletion. thank you!

ShaneHarvey commented 7 years ago

The drop still takes some time because Elasticsearch 2.0 removed the ability to delete the mapping for a type: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/indices-delete-mapping.html. To work around that limitation, the elastic2-doc-manager deletes the mapping with a streaming_bulk delete.

You may be able to speed up the sync by increasing the number of operations that the elastic-doc-manager batches:

{
    "docManagers": [
        {
            "docManager": "elastic2_doc_manager",
            "targetURL": "localhost:9200",
            "bulkSize": 5000,
            "autoCommitInterval": 3
        }
    ]
}
sopnopriyo commented 7 years ago

Hi @sliwinski-milosz ,

Thanks for asking. I am started using the latest commit you mentioned. I will let you know the updates. Thanks a lot for your support.

Regards, Shahin

sopnopriyo commented 7 years ago

Hi @sliwinski-milosz ,

As discussed earlier, using the latest commit you mentioned but the problem is now that it does not catch the changes in MongoDB anymore. Mongo-connector is active since last 6 days but could not defect any changes in the database.

Can you please look into it ?

Thanks, Shahin

sliwinski-milosz commented 7 years ago

@sopnopriyo Please share your configuration

bfrggit commented 7 years ago

Calling db.<collection_name>.drop() directly will cause a warning in connector, and mongodb_meta index entries are not removed and stay there forever unless you delete mongodb_meta index in ES, too. However, this does not work if I have more than one index syncing from MongoDB.