yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 478 forks source link

Use Elasticsearch Update API #146

Open llvtt opened 10 years ago

llvtt commented 10 years ago

This can replace the retrieve + apply strategy that we currently use in the Elastic DocManager. See the following:

We'll need something that can transform MongoDB "update specs" (e.g. {$set: {"x.y.z": 42}}) to Elasticsearch update scripts. I don't believe we can use partial documents (as opposed to scripts) to update docs stored in Elasticsearch, because there doesn't seem to be a way to nullify a field that way.

xmasotto commented 10 years ago

As Luke noted, there are two ways to apply updates using elastic search. One is to pass a partial document which will be merged with the document, which can add/update fields but not remove them. The second and more flexible option is to send a javascript snippet that will be execute and to produce the updated document. Unfortunately, dynamic scripting (sending the script as a string to the server) has some restrictions, and appears to be disabled by default. (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html) I'm not sure what they mean by "sandboxed language", so I'll have to look into that more.

In order to support the full update functionality, we probably would need users to enable dynamic scripting in configuration, which might be undesirable for default behavior.

vvaradhan commented 10 years ago

@xmasotto From http://www.elasticsearch.org/blog/elasticsearch-1-3-0-released/ - groovy and Lucene expression are default sandboxed languages. Sandbox restricts the use of certain classes/methods - for example: java.lang.System and .getClass() in every object would be blocked and can't be used.

vvaradhan commented 10 years ago

@xmasotto As for default configuration, as of 1.3.0, it is defaulted to "sandbox".