yougov / mongo-connector

MongoDB data stream pipeline tools by YouGov (adopted from MongoDB)
Apache License 2.0
1.88k stars 479 forks source link

Consider putting back namespaces in oplog query #684

Open makhdumi opened 7 years ago

makhdumi commented 7 years ago

The oplog query used to have a 'ns': { $in: [...] } filter. This was removed at some point, and filtering on namespace is now done in code (python).

But this can be quite slow, e.g. with databases containing many collections or if there're other collections that "pollute" the oplog a lot more than the collections the connector is interested in. It's much faster, at least on our cluster, to let the MongoDB server do the filtering. Without the cursor filter, for me, the connector lags by 1000 seconds-12000 seconds with a pretty low update throughput to MongoDB - about ~30 updates / ~second~ minute.

I understand that namespaces are now configurable with wildcards/regex, but if no wildcard/regex is specified in the config, then could the connector go back to doing the filtering on the cursor?

ShaneHarvey commented 7 years ago

The change to remove the 'ns': { $in: [...] } filter was not to support wildcards in namespaces, we can always use a $regex query for that. It was removed because if we filter out ignored collections, the oplog might become completely filled with operations that don't match our query. mongo-connector would then abort with an error because its last seen oplog entry is no longer there.

Although it might be possible to add back the 'ns': { $in: [...] } filter if we periodically update the checkpoint the latest ignored entry. Hmmmm, I'll have to get this a little more thought but I'd like to add it back if it improves the performance significantly.