zalando-nakadi / nakadi-producer-spring-boot-starter

Nakadi event producer as a Spring boot starter
MIT License
13 stars 8 forks source link

Add a QueryStatementBatcher #173

Closed ePaul closed 1 year ago

ePaul commented 1 year ago

This allows to fake batches for statements returning data. Not yet used in the rest of the code, that will be added in later commits.

(This came up as part of #170.)

JDBC distinguishes between "queries" (which can return results) and "updates" (which only can return a number of affected rows). Only for updates you can use the batching feature of JDBC (i.e. sending multiple statements together to the server to have them executed together, saving the round trips to the DB and back). This is then also reflected in Spring's JdbcTemplate (and NamedParameterJdbcTemplate).

As we want to return the auto-generated ID from the inserted rows, the "batching" feature won't work for us. Searching in the WWW, I found Batching Select Statements in JDBC, which suggests this approach:

This class implements this concept in a generic way, based on Spring's NamedParameterJdbcTemplate. The user provides a template for the SQL string (prefix, repeated part, suffix) with parameter name placeholders, and this class will take case of expanding the template into the several batch sizes of statements, and then using it to actually query/insert/return data in bulk.

(Mentioned in this Stack Overflow Answer.)

fbrns commented 1 year ago

πŸ‘

ePaul commented 1 year ago

(I've not yet merged this, because I'm thinking of doing a benchmark.)

ePaul commented 1 year ago

I did run a very simple benchmark (as a Unit test running against a local DB, for tiny data sets (inserting a row with two integers, returning an ID). Code is in QueryStatementBatcherIT (but disabled for normal builds, to not slow the build down).

The difference is likely less extreme for larger data, but more extreme when the database is further away (network-wise).

Method inserted item count time
QueryStatementBatcher (51, 13 ,4, 1) 5000 0.57s
QueryStatementBatcher (1100, 200, 51, 13, 4, 1) 5000 0.16s
separate queries, serially 5000 22.08s
separate queries, parallelStream 5000 3.2s
batchUpdate (without returning results) 5000 0.10s

The last two don't really work in our case – parallelStream is inserting the items in an arbitrary order, and batchUpdate (which is what we were using before) won't return the generated IDs.

Though this indicates that we generally only should use the batcher when we actually need the IDs, as there is some performance penalty for getting them.

Let me see whether there is an easy way to do this kind of batching where there is nothing to return.

ePaul commented 1 year ago

Some more investigation showed that JDBC actually (since JDBC 3.0, in Java 1.4) [provides a way for fetching the keys generated in update statements](https://docs.oracle.com/en/java/javase/17/docs/api/java.sql/java/sql/Statement.html#getGeneratedKeys()).

In Spring JDBC, this is mapped via the KeyHolder interface which can be passed to some of the update methods to JDBCTemplate and NamedParameterJDBCTemplate. Unfortunately, the batchUpdate methods don't support this.

I've found this Stack overflow answer showing a way how to do both, proving that it works (if your DB + JDBC driver supports it – it seems like postgresql does).

I've added a modified version of that answer to the benchmark (not committed) – it takes slightly less time than my QueryStatementBatcher with huge batches (0.15s instead of 0.16s), but still significantly more than the "do not return IDs" version doing a simple batch update.

There is an old issue (originally opened in Jira in 2006, closed there in 2012) for Spring JDBC suggesting to add this to JdbcTemplate + NamedParameterJdbcTemplate, with a relatively new pull request from 2022 actually implementing it (in a cleaner way than the mentioned SO answer). Given that this one is not yet merged, and thus won't arrive in spring JDBC soon, I'll not wait for it.

ePaul commented 1 year ago

Let me see whether there is an easy way to do this kind of batching where there is nothing to return.

I've also experimented with extending the QueryStatementBatcher to build similar "batched" statements for pure inserts (without returning results) – it is a bit faster than the batcher when returning results, but still slower than the plain batcher. The code is in a separate branch which I don't intend merging, just in case someone is interested.

ePaul commented 1 year ago

:+1:

fbrns commented 1 year ago

πŸ‘