yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 478 forks source link

Failed during dump collection cannot recover #552

Open JPacks opened 7 years ago

JPacks commented 7 years ago

I am trying to sync mongodb replica to elasticsearch using mongo-connector. It works fine when I insert the first doc in my collection "check". But getting "Failed during dump collection cannot recover" error in mongo-connector.log during the second doc insertion. Due to this error, the second doc is getting loaded into an elasticsearch index.

The command I used is: To start Mongo replica: sudo mongod --port 27017 --dbpath /_/_//**\ --replSet rs0 To start Mongo Connector:** mongo-connector -m localhost:27017 -t localhost:9200 -d elastic_doc_manager --auto-commit-interval=0 -n a.check

Mongo-connector.log : 2016-10-13 17:27:45,381 [CRITICAL] mongo_connector.oplog_manager:630 - Exception during collection dump Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/mongo_connector/oplog_manager.py", line 583, in do_dump upsert_all(dm) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/oplog_manager.py", line 567, in upsert_all dm.bulk_upsert(docs_to_dump(namespace), mapped_ns, long_ts) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 43, in wrapped reraise(new_type, exc_value, exc_tb) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 32, in wrapped return f(_args, *_kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic_doc_manager.py", line 214, in bulk_upsert for ok, resp in responses: File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/init.py", line 160, in streaming_bulk for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs): File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/init.py", line 89, in _process_bulk_chunk raise e ConnectionFailed: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=10)) 2016-10-13 17:27:45,381 [ERROR] mongo_connector.oplog_manager:638 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, replicaset=u'rs0'), u'local'), u'oplog.rs') 2016-10-13 17:27:46,376 [ERROR] mongo_connector.connector:304 - MongoConnector: OplogThread <OplogThread(Thread-2, started 140648179619584)> unexpectedly stopped! Shutting down

FYI, I am using elasticsearch 2.3.1 ,mongodb 3.0.12 and mongo-connector 2.4.1

ShaneHarvey commented 7 years ago

Looks like you are hitting a ReadTimeoutError on Elastic. Try increasing the timeout using a config file such as:

{
  "mainAddress": "localhost:27017",
  "verbosity": 3,
  "namespaces": {
    "include": ["a.check"]
  },
  "docManagers": [
    {
      "docManager": "elastic_doc_manager",
      "targetURL": "localhost:9200",
      "autoCommitInterval": 0,
      "args": {
        "clientOptions": {"timeout": 30}
      }
    }
  ]
}

You also can use the continueOnError option to force mongo-connector to log and ignore errors during the collection dump.

mumlax commented 7 years ago

I'm also running in this error suddenly, when doing a resync. It worked for a long time.

2017-01-19 12:43:52,690 [CRITICAL] mongo_connector.oplog_manager:666 - Exception during collection dump
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/oplog_manager.py", line 621, in do_dump
    upsert_all(dm)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/oplog_manager.py", line 607, in upsert_all
    mapped_ns, long_ts)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 44, in wrapped
    reraise(new_type, exc_value, exc_tb)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 33, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 367, in bulk_upsert
    for ok, resp in responses:
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 91, in _process_bulk_chunk
    raise e
ConnectionFailed: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=60))
2017-01-19 12:43:52,703 [ERROR] mongo_connector.oplog_manager:674 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(host=[u'localhost:27017'], document_class=dict, tz_aware=False, connect=True, replicaset=u'singleNodeRepl'), u'local'), u'oplog.rs')
2017-01-19 12:43:53,241 [ERROR] __main__:357 - MongoConnector: OplogThread <OplogThread(Thread-3, started 140353541756672)> unexpectedly stopped! Shutting down

I'm using mongo-connector version 2.5.0, pymongo version 3.4.0, MongoDB version 3.2.10 and elastic2_doc_manager version 0.3.0. I'm storing with this setup more than 100M documents.

I already raised the timeout to 60 like you can see in the log.

Previously, the following error appeared already so that I had to start the resync:

2017-01-19 08:58:36,553 [ERROR] mongo_connector.doc_managers.elastic2_doc_manager:412 - Exception while commiting to Elasticsearch
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 406, in commit
    successes, errors = bulk(self.elastic, action_buffer)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 190, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 91, in _process_bulk_chunk
    raise e
ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'localhost', port=9200): Read timed out. (read timeout=10))

Don't know if this affects the newest error. Should I just set continueOnError? Are documents ignored (=>not synced), when an error appears and this option is set?

ShaneHarvey commented 7 years ago

With continueOnError, documents that fail to sync during the collection dump period will be ignored. The general problem is that the Elasticsearch doc managers do not retry on connection/operation failure, see https://github.com/mongodb-labs/elastic2-doc-manager/issues/18.

For now, I can only recommend increasing the Elasticsearch client timeout again. Do you see any errors or warnings in the Elasticsearch logs?