yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 479 forks source link

mongo-connector not reconnecting to elasticsearch after the ReadTimeoutError occurs #402

Open kawaljeet opened 8 years ago

kawaljeet commented 8 years ago

When the connectivity between mongo-connector and elasticsearch is broken for a short while, mongo-connector should reconnect automatically, but it is not.

I manually stopped the elastisearch instance, to break the connectivity. I got the following messages in the mongo-connector.log file (find the log messages in the end)

After 5 minutes, I restarted the elasticsearch instance. The elastic search services are running fine and it is listening to port 9200.

But mongo-connetor didnt reconnected it (after elastic search again became available). I also tried by updating the collection, but it didnt not reconnected to elasticsearch.

Is there any configuration or the run time parameter, such that it can reconnect to the elasticsearch instance ?

I have also put the configuration file in the end Log message

2016-03-02 22:13:28,770 [WARNING] elasticsearch:82 - PUT http://10.9.220.233:9200/people/people/56d7657ea4dc6d2b7c055908?refresh=false [status:N/A request:10.005s] Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/elasticsearch-1.9.0-py2.7.egg/elasticsearch/connection/http_urllib3.py", line 78, in perform_request response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, _kw) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 608, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/util/retry.py", line 224, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 558, in urlopen body=body, headers=headers) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 380, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 308, in _raise_timeout raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value) ReadTimeoutError: HTTPConnectionPool(host=u'10.9.220.233', port=9200): Read timed out. (read timeout=10) 2016-03-02 22:13:28,771 [ERROR] mongo_connector.util:87 - Fatal Exception Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.3.dev0-py2.7.egg/mongo_connector/util.py", line 85, in wrapped func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.3.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 261, in run docman.upsert(doc, ns, timestamp) File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.3.dev0-py2.7.egg/mongo_connector/util.py", line 32, in wrapped return f(_args, *_kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.3.dev0-py2.7.egg/mongo_connector/doc_managers/elastic_doc_manager.py", line 143, in upsert refresh=(self.auto_commit_interval == 0)) File "/usr/local/lib/python2.7/dist-packages/elasticsearch-1.9.0-py2.7.egg/elasticsearch/client/utils.py", line 69, in _wrapped return func(_args, params=params, kwargs) File "/usr/local/lib/python2.7/dist-packages/elasticsearch-1.9.0-py2.7.egg/elasticsearch/client/init**.py", line 263, in index _make_path(index, doc_type, id), params=params, body=body) File "/usr/local/lib/python2.7/dist-packages/elasticsearch-1.9.0-py2.7.egg/elasticsearch/transport.py", line 307, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/elasticsearch-1.9.0-py2.7.egg/elasticsearch/connection/http_urllib3.py", line 86, in perform_request raise ConnectionTimeout('TIMEOUT', str(e), e) ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'10.9.220.233', port=9200): Read timed out. (read timeout=10)) . find below the configuration file i am using

{ "comment": "Configuration options starting with '' are disabled", "__comment": "To enable them, remove the preceding ''", "mainAddress": "XX.XX.XX.XX:27017", "oplogFile": "oplog.timestamp", "noDump": false, "batchSize": -1, "verbosity": 1, "continueOnError": true, "logging": { "type": "file", "filename": "/root/mongoconnector.log", "format": "%(asctime)s [%(levelname)s] %(name)s:%(lineno)d - %(message)s", "rotationWhen": "D", "rotationInterval": 1, "rotationBackups": 10, "type": "syslog", "host": "localhost:514" }, "namespaces": { "include": [ "db.coll" ], "mapping": { "db.coll": "index.type" } }, "docManagers": [ { "docManager": "elastic_doc_manager", "targetURL": "XX.XX.XX.XX:9200", "bulkSize": 1000, "__uniqueKey": "_id", "autoCommitInterval": 120 } ] }

aherlihy commented 8 years ago

Hello! I just wanted to let you know I'm looking into this and should have an update for you soon. Thank you for your patience!

kawaljeet commented 8 years ago

Sure @aherlihy .. thanks a lot for that. If you need any other information, do let me know. Right now i am running cron job to monitor the stale connection and then restarting the services.

To add on more on this. After mongo connector and elastic servers are connected (both running on different servers) a. Stop the elastic search server. I can still see the tcp state @ port 9200 as ESTABLISHED. Though the elastisearch is down, b. Try updating the collection (elastic server still down). mongo connector will try to connect to elastic search server and now the tcp state changes to FIN_WAIT.. closing the connection, and it will throw errors. in the log files. c. Start the elastic search server. After c), my expectation was, the mongo connector should connect automatically. But it isnt.

kawaljeet commented 8 years ago

Hi @aherlihy ... wanted to know, if you were able to reproduce the issue ?

aherlihy commented 8 years ago

Hi @kawaljeet, I haven't found a solution to this issue yet. We are looking into how the elasticsearch driver deals with reconnecting and trying to find the best behavior to emulate.

To give you a timeline, because of the easter holidays there probably won't be any updates for the next week or so. Apologies for the delay!

kawaljeet commented 8 years ago

Hi @aherlihy .. please let me know, if you have any update on same. Thanks

pdspicer commented 8 years ago

Hi @aherlihy & @kawaljeet I'm no python expert by any means, but I have experienced the same issue using elastic2 doc manager and after looking through this for a while, it seems as though the likely source of the issue is that the elastic2 doc manager's exception_wrapper is not working for subclasses of the wrapped error types, thus the error is not being handled by the wrapper and is causing the OplogThread to exit.

Additionally, it appears as though the Connector thread does not detect the OplogThread's unclean exit, as the only check it does in it's run loop is to see if the thread's "running" parameter is set to false, which it does not seem it will be if an unhanded exception occurs.

The result is that unhandled exceptions go unnoticed, and the process neither exits nor continues processing the oplog for the failed thread (which either represents all of the oplog or the oplog of a single shard).

The fix seems as though it should be twofold:

  1. Add a self.shard_set[shard_id].is_alive() call alongside the self.shard_set[shard_id].running check in connector.py
  2. Open tickets in elastic-doc-manager & elastic2-doc-manager for handling otherwise unhandled exceptions from elasticsearch-py, as a simple TimeoutError should not be sufficient to cause the thread to exit with an unhandled exception.
llvtt commented 8 years ago

@kawaljeet , would you mind trying against the latest version from master? I just accepted a patch from @pdspicer that allows the connector to catch additional exception types like ReadTimeoutError.

kawaljeet commented 8 years ago

@llvtt .. Apologies for the late response. I have pulled up the new version (couple of hours back) and got the new mongo-connector. So, far things looks fine. But I need to monitor for some time (probably another couple of days). and will again share the feedback. Thanks a lot again.' Thanks a lot @pdspicer for your help too.

kawaljeet commented 8 years ago

Hi @llvtt / @pdspicer , mongo-connector is responding better in ReadTimeoutError case now, Thanks for the fix. Though sometimes it hangs, i am printing the log file entries, when it again stopped responding. ( I mean, even though there is update in the collection, there is no update in the elastic).

mongo-connector.log

2016-08-23 11:51:35,507 [WARNING] elasticsearch:82 - DELETE http://XX.XXX.X.XX:9200/myindex/mytype/57b6202ce4b0370214fc90ba?refresh=false [status:N/A request:10.011s] Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 94, in perform_request response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, _kw) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 608, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/util/retry.py", line 224, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 558, in urlopen body=body, headers=headers) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 380, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/local/lib/python2.7/dist-packages/urllib3-1.14-py2.7.egg/urllib3/connectionpool.py", line 308, in _raise_timeout raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value) ReadTimeoutError: HTTPConnectionPool(host=u'10.9.220.233', port=9200): Read timed out. (read timeout=10) 2016-08-23 11:51:35,508 [ERROR] mongo_connector.util:87 - Fatal Exception Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 85, in wrapped func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/oplog_manager.py", line 244, in run entry['o']['_id'], ns, timestamp) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 32, in wrapped return f(_args, *_kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 254, in remove refresh=(self.auto_commit_interval == 0)) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped return func(_args, params=params, kwargs) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/init**.py", line 722, in delete doc_type, id), params=params) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 102, in perform_request raise ConnectionTimeout('TIMEOUT', str(e), e) ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'10.9.220.233', port=9200): Read timed out. (read timeout=10))

mongo-connector.err

Exception in thread Thread-718: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 755, in run self.function(_self.args, _self.kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 293, in run_auto_commit self.elastic.indices.refresh() File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped return func(_args, params=params, _kwargs) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/indices.py", line 54, in refresh '_refresh'), params=params) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 102, in perform_request raise ConnectionTimeout('TIMEOUT', str(e), e) ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'10.9.220.233', port=9200): Read timed out. (read timeout=10))

Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 551, in *bootstrap_inner self.run() File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 85, in wrapped func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/oplog_manager.py", line 244, in run entry['o']['_id'], ns, timestamp) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/util.py", line 32, in wrapped return f(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 254, in remove refresh=(self.auto_commit_interval == 0)) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped return func(_args, params=params, _kwargs) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init**.py", line 722, in delete doc_type, id), params=params) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 102, in perform_request raise ConnectionTimeout('TIMEOUT', str(e), e) ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'10.9.220.233', port=9200): Read timed out. (read timeout=10))

ShaneHarvey commented 7 years ago

The elastic2-doc-manager issue that tracks retrying operations on connection failure: https://github.com/mongodb-labs/elastic2-doc-manager/issues/18