yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 478 forks source link

Sync Issues Elasticsearch #678

Open vaibhavbhalla07 opened 7 years ago

vaibhavbhalla07 commented 7 years ago

Hi I am having issues with synchronization of mongo collection with elasticsearch.

This is the only exception which I have obversed in logs

[ERROR] mongo_connector.doc_managers.elastic2_doc_manager:491 - Bulk request failed with exception Traceback (most recent call last): File "/opt/deployment/elastic5/python3/local/lib/python3.4/site-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 484, in send_buffered_operations successes, errors = bulk(self.elastic, action_buffer) File "/opt/deployment/elastic5/python3/local/lib/python3.4/site-packages/elasticsearch/helpers/__init__.py", line 194, in bulk for ok, item in streaming_bulk(client, actions, **kwargs): File "/opt/deployment/elastic5/python3/local/lib/python3.4/site-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs): File "/opt/deployment/elastic5/python3/local/lib/python3.4/site-packages/elasticsearch/helpers/__init__.py", line 134, in _process_bulk_chunk raise BulkIndexError('%i document(s) failed to index.' % len(errors), errors) elasticsearch.helpers.BulkIndexError: ('250 document(s) failed to index.', [{'delete': {'result': 'not_found', '_version': 3, '_type': 'mongodb_meta', '_index': 'mongodb_meta', '_id': '58bb6119e4b07ed3aa1abfd8', 'found': False, '_shards': {'successful': 2, 'failed': 0, 'total': 2}, 'status': 404}}, {'delete': {'result': 'not_found', '_version': 3, '_type': 'mongodb_meta', '_index': 'mongodb_meta', '_id': '58bb6119e4b07ed3aa1abfd9', 'found': False, '_shards': {'successful': 2, 'failed': 0, 'total': 2}, 'status': 404}}, {'delete': {'result': 'not_found', '_version': 3, '_type': 'mongodb_meta', '_index': 'mongodb_meta', '_id': '58bb6119e4b07ed3aa1abfda', 'found': False, '_shards': {'successful': 2, 'failed': 0, 'total': 2}, 'status': 404}}, ... so on 250 documents..

We usually perform large document deletion operations on mongo collection suspecting it to be related with the issue.

Please guide me on this.

Thanks

gandola commented 7 years ago

I'm getting same errors deleting a large number of documents.

sliwinski-milosz commented 7 years ago

Are all of these errors for: '_type': 'mongodb_meta', '_index': 'mongodb_meta'?

Whenever we index docs to elasticsearch, we index meta_index as well. What is the mongodb_meta index in Elasticsearch?

Logic snippet

As you can see in the snippet we always use same _type for meta_action while using different _type for action. I suspect that there are cases for which you have same document_id for different doc_type. That causes that you have two indexes created in Elasticsearch and only one corresponding meta_index. Whenever you try to delete these two indexes (even not in the same bulk) -> on the first one you delete meta_index and on the second one it fails as meta_index has already been deleted.

Maybe we should use doc_type instead of self.meta_type for meta_index but... I don't know the logic behind meta_indexes so lets wait till someone who knows it will take a look on that.

gandola commented 7 years ago

Hi @sliwinski-milosz,

That's exactly the reason, I found that we were inserting the same _id into 2 different collections and then we eventually delete data from both and for that reason we are getting this problem.

Even knowing that this is a bug on the app side, your idea of using the doc_type makes sense.

Thanks for the insight!