will / crystal-pg

a postgres driver for crystal
BSD 3-Clause "New" or "Revised" License
463 stars 78 forks source link

Support for query pipelining? #155

Open vlazar opened 6 years ago

vlazar commented 6 years ago

Recently released Round 17 of the TechEmpower Framework Benchmarks has an interesting change:

https://www.techempower.com/blog/2018/10/30/framework-benchmarks-round-17/

As you review Round 17 results, you'll notice that Postgres database tests are stratified—there are groups of test implementations that seem implausibly faster than other test implementations.

The underlying cause of this is use of a Postgres protocol feature we have characterized as "query pipelining" because it is conceptually similar to HTTP pipelining. We call it pipelining, but you could also call it multiplexing. It's a feature of the "Extended Query" protocol in Postgres. Query pipelining allows database clients to send multiple queries without needing to wait for each response before sending the next. It's similar to batching but provided invisibly in the driver.

will commented 6 years ago

This would be cool for sure. I wonder how to get pipelining and keep the crystal/db interface. It seems like keeping track of which result sets go where would be a bit tricky. I haven’t thought about this much though, there might be some nice, easy way

On Wed, Nov 7, 2018 at 00:37 Vladislav Zarakovsky notifications@github.com wrote:

Recently released Round 17 of the TechEmpower Framework Benchmarks has an interesting change:

https://www.techempower.com/blog/2018/10/30/framework-benchmarks-round-17/

As you review Round 17 results, you'll notice that Postgres database tests are stratified—there are groups of test implementations that seem implausibly faster than other test implementations.

The underlying cause of this is use of a Postgres protocol feature we have characterized as "query pipelining" because it is conceptually similar to HTTP pipelining. We call it pipelining, but you could also call it multiplexing. It's a feature of the "Extended Query" protocol in Postgres. Query pipelining allows database clients to send multiple queries without needing to wait for each response before sending the next. It's similar to batching but provided invisibly in the driver.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/will/crystal-pg/issues/155, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAHtYdgWSdiW8vFv8Ue1saumiM0bbt4ks5uspvVgaJpZM4YR_fL .

rishavs commented 5 years ago

This would be fantastic. Right now the DB connection is the only thing in my crystal webapp which feels a bit slow.

eliasjpr commented 5 years ago

Would love to work with someone on this.

While looking for documentation I found the following explanation and implementation:

eliasjpr commented 5 years ago

Also have anyone has considered binding LibPQ instead? https://www.postgresql.org/docs/9.5/libpq.html

It seems to have solve all of these implementations already. We could bind to the library and create a Crystal-DB interface to it. Thoughts?

will commented 5 years ago

This started out as a binding to libpq but moved away to remove some nasty problems with the linking on some platforms, and to take easier advantage of crystal’s async io.

On Sep 30, 2019, at 11:37, Elias Perez notifications@github.com wrote:

 Also have anyone has considered binding LibPQ instead? https://www.postgresql.org/docs/9.5/libpq.html

It seems to have solve all of these implementations already. We could bind to the library and create a Crystal-DB interface to it. Thoughts?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

eliasjpr commented 5 years ago

I think the community can benefit from a native crystal implementation too

jgaskins commented 2 years ago

Just posted #253 to support query pipelining, if y'all could please have a look. I've been evaluating TimescaleDB for a side project and being able to pipeline INSERT queries for datapoints would be fantastic.

straight-shoota commented 2 years ago

@jgaskins Looks nice. Regarding your questions about the API design, perhaps it makes sense to focus on a very low level API at first. This enables using the feature, but you need to know what you're doing. Later there could be a more high level API on top of that.

For a high level API I think it could be useful to combine the query and result interpretation into a promise object. I think for this there's not even a need for wrapping this into a pipeline block and could even be interleaved with direct queries on the same connection.

query1 = connection.pipeline.query_one "SELECT 42", as: Int32
query2 = connection.pipeline.query_all "SELECT * FROM posts LIMIT $1", 10, as: Post

answer = query1.result
posts = query2.result

Behind the scenes, pipe_query creates a pipeline promise containing the query statement and a result information. As soon as the first result method is called, it triggers a pipeline commit. This sends all queries waiting in the pipeline and then parses the results, placing the values into the promise objects. The second result call returns directly because the value has already been retrieved.

jgaskins commented 2 years ago

@straight-shoota That was the first thing I tried. I wanted the pipeline to use the DB::QueryMethods API. For multiple reasons, it doesn’t work with the Crystal type system.

straight-shoota commented 2 years ago

How so? I would think this should be perfectly representable in Crystal's type system.

jgaskins commented 2 years ago

I thought it would be, too, until I tried it and realized that returning objects of a specific type is easy but storing them in memory for processing later is not.

Feel free to try it out if you're curious, but it's a moot point anyway. That's not how query pipelines are used in practice, and interleaving pipelined vs immediate queries like you suggested would increase cognitive load and/or would be prone to bugs. For example, consider this:

posts = pg.pipeline.query_one "...", as: Post
comments = pg.pipeline.query_all "...", as: Comment
author = pg.query_one "...", as: User
reactions = pg.pipeline.query_all "...", as: Reaction

posts = posts.result
comments = comments.result
reactions = reactions.result

If you send the first two queries before the third, this code breaks. If you don't send the first two queries before the third, it'd be very confusing when debugging as to why the queries are appearing out of order. The pipeline block in #253 communicates a clear start and end point of the pipelined queries to anyone reading the code.

Even if the queries are not being interleaved like that, the structure provided by the block is still better than without it:

posts = pg.pipeline.query "..."
comments = pg.pipeline.query "..."
author = pg.pipeline.query "..."
reactions = pg.pipeline.query "..."

posts = posts.read_one(Post)
comments = comments.read_all(Comment)
author = author.read_one(User)
reactions = reactions.read_all(Reaction)

# vs

rs = pg.pipeline do |pipe|
  pipe.query "..."
  pipe.query "..."
  pipe.query "..."
  pipe.query "..."
end

posts = rs.read_one(Post)
comments = rs.read_all(Comment)
author = rs.read_one(User)
reactions = rs.read_all(Reaction)

The only difference between these two is promise vs block and the second is still significantly easier to grok.

straight-shoota commented 2 years ago

My concern about a structure where query and reading the result are completely separate is that you have to make sure to use the same order in both. But this probably doesn't matter too much. Or at least it's already great to have a functioning lower level API. For such an advanced feature this might actually be good enough. You'll just have to make sure to use it properly.

jgaskins commented 2 years ago

My concern about a structure where query and reading the result are completely separate is that you have to make sure to use the same order in both.

I hear you on this, but a query pipeline is effectively a channel, so this makes sense. That you're thinking about the problem in a different way is not a bad thing, but while a bespoke async I/O implementation with promises can be implemented on top of a pipeline, it is not itself a pipeline.

For such an advanced feature this might actually be good enough

Agreed that it's pretty advanced, and will likely rarely be used. I imagine most folks are using ORMs for most DB interactions and will only dip into an API like this at all when they really need the performance.