yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 478 forks source link

spotcheck about 20% data do not sync update from mongodb to elasticsearch #604

Open lifuchao opened 7 years ago

lifuchao commented 7 years ago

elaticsearch:v2.3.4 mongodb:v3.2.9 mongo-connector:v2.4.1

the question is below: I use mongo-connector syn data from mongodb to elasticsearch,the collection has about 80,000,000 docs, initially sync is ok,but my data in mongodb will update,and I found some data could not sync update to elasticsearch.I spotcheck 1000 docs,and maybe about 200 docs did not sync. e.g: field "key1" value before is "aaaa", after is "bbbb",but the data in elasticsearch is still "aaaa",and there's no error in mongo-connector.log. and how does mongo-connector sync update ?how does it ensure data will sync date?

thanks in advance!

mumlax commented 7 years ago

What about the elasticsearch doc-manager? Did you manually updated him to use the bulk api? The number of 1000 sounds suspicious. Then my old "problem" here could be the reason: You have to set the autoCommitInterval.

ShaneHarvey commented 7 years ago

@lifuchao we just released mongo-connector 2.5.0 and elastic2-doc-manager 0.3.0. Would you be able to upgrade to the latest version and check back if the issue is still present or not?

To upgrade mongo-connector and the elastic2-doc-manager: pip install --upgrade 'mongo-connector[elastic2]'

lifuchao commented 7 years ago

@ShaneHarvey I tried to update as you say ,but failed. the error info in pip.log file like this: Ignoring link https://pypi.python.org/packages/ed/5f/c5b60c72c08773d60b83d8255a4e1b73d3ff9eeece780e5f22be7dbc1c67/pymongo-0.14.tar.gz#md5=96c7b066815445e75ad095c0fa760eab (from https://pypi.python.org/simple/pymongo/), version 0.14 doesn't match >=2.9 Ignoring link https://pypi.python.org/packages/ef/2e/d05c3d2e244d26f65a71bec20b6080c54cfbd97eaa9d6c358dcfbea62425/pymongo-0.5.3pre.tar.gz#md5=4c09638b71b3590f82b9f8529689bdb8 (from https://pypi.python.org/simple/pymongo/), version 0.5.3pre doesn't match >=2.9 Ignoring link https://pypi.python.org/packages/fe/6c/5cf65618ee2248e264c1825395b16b1a0f3e96349d340db4af04c386ea8c/pymongo-2.3.tar.gz#md5=0d342ad1506f983af671d0b0e0e1efec (from https://pypi.python.org/simple/pymongo/), version 2.3 doesn't match >=2.9 Using version 3.4.0 (newest of versions: 3.4.0, 3.3.1, 3.3.0, 3.3.0, 3.2.2, 3.2.1, 3.2, 3.1.1, 3.1, 3.0.3, 3.0.2, 3.0.1, 3.0, 2.9.4, 2.9.3, 2.9.2, 2.9.1, 2.9) Downloading/unpacking pymongo>=2.9 from https://pypi.python.org/packages/82/26/f45f95841de5164c48e2e03aff7f0702e22cef2336238d212d8f93e91ea8/pymongo-3.4.0.tar.gz#md5=aa77f88e51e281c9f328cea701bb6f3e (from mongo-connector[elastic2]) Downloading from URL https://pypi.python.org/packages/82/26/f45f95841de5164c48e2e03aff7f0702e22cef2336238d212d8f93e91ea8/pymongo-3.4.0.tar.gz#md5=aa77f88e51e281c9f328cea701bb6f3e Cleaning up... Removing temporary dir /tmp/pip_build_root... Exception: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main status = self.run(options, args) File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1198, in prepare_files do_download, File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1376, in unpack_url self.session, File "/usr/lib/python2.7/dist-packages/pip/download.py", line 572, in unpack_http_url download_hash = _download_url(resp, link, temp_location) File "/usr/lib/python2.7/dist-packages/pip/download.py", line 433, in _download_url for chunk in resp_read(4096): File "/usr/lib/python2.7/dist-packages/pip/download.py", line 421, in resp_read chunk_size, decode_content=False): File "/usr/share/python-wheels/urllib3-1.7.1-py2.py3-none-any.whl/urllib3/response.py", line 225, in stream data = self.read(amt=amt, decode_content=decode_content) File "/usr/share/python-wheels/urllib3-1.7.1-py2.py3-none-any.whl/urllib3/response.py", line 174, in read data = self._fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 573, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) File "/usr/lib/python2.7/ssl.py", line 341, in recv return self.read(buflen) File "/usr/lib/python2.7/ssl.py", line 260, in read return self._sslobj.read(len) SSLError: The read operation timed out

lifuchao commented 7 years ago

@ShaneHarvey I have updated successfully like this : sudo pip install --upgrade 'mongo-connector[elastic2]' --default-timeout=600

but whether this will solve the problem of data update syn,I will spotcheck these days and report later.
thanks a lot.

lifuchao commented 7 years ago

@ShaneHarvey after I upgraded mongo-connector to V2.5.0 ,I spotcheck the data of mongo and elasticsearch, there're still some data can not sync (about 10%).I want to know how mongo-connector sync update,is there any references? my es index has doc about 0.1 billion.and every day will have data update, how could I assure the data sync update successfully? hope for your help,thanks.

ShaneHarvey commented 7 years ago

Can you post the steps to reproduce this issue (sample MongoDB data, sample updates that trigger the missing updates, and mongo-connector config file)? Otherwise, there's no way for me to find out if/where this is a bug in mongo-connector.