Connection pool unreliable?

m-o-e commented 4 years ago

Currently DB::Database does not recover from temporary connection failures. It does not seem to evict broken connections from the pool but continues to attempt to use them.

This means just a single brief outage will leave the pool in a permanently broken state.

The only way to recover from connection failures appears to be to discard the entire DB::Database instance and create a new pool.

Is this a bug or am I using it wrong? 🤔

Test-case:

require "db"
require "pg"

initial_pool_size=5
max_pool_size=5
retry_attempts=99999

url = "postgres://localhost/test?initial_pool_size=#{initial_pool_size}&max_pool_size=#{max_pool_size}&max_idle_pool_size=#{max_pool_size}&checkout_timeout=5.0&retry_attempts=#{retry_attempts}&retry_delay=1"

puts "Trying: #{url}"
db = DB.open(url)
loop do
  begin
    db.query("select 42").close
    puts "."
  rescue e
    puts e
  end
  sleep 1
end

Just leave the above running and while it runs restart your local postgres server.

Regardless of what settings you choose, you will see the following output:

.
.
.
Error reading socket: Connection reset by peer
Error writing to socket: Broken pipe
Error writing to socket: Broken pipe
Error writing to socket: Broken pipe
[...]

bcardiff commented 4 years ago

As long as you are not checking out connections explicitly the pool should be able to recover from temporary outages. Sample in mysql https://crystal-lang.org/reference/database/connection_pool.html

Maybe an exception with the wrong type is raised and the connection loss is not detected.

But you snippets looks good and i would expect to be resilient.

m-o-e commented 4 years ago

Possibly it's postgres specific?

I couldn't find a way to make my above snippet resilient - it always breaks when the server is restarted.

straight-shoota commented 4 years ago

Yeah, I've encountered similar errors with postgres as well, but never did any deeper investigation. So maybe it's an issue in crystal-pg.

stakach commented 3 years ago

also finding this an issue working with GCP's hosted postgres compatible database and new connections take upwards of 2 seconds to connect meaning we really need to rely on the pool. So we have the pool configured with max_idle_pool_size == max_pool_size == 50 but if nothing is going on for awhile the connections eventually time out and broken sockets are returned causing 500 errors in our web app DB::ConnectionLost errors

Also seeing the pool not recovering from DB outages - we work around that using health checks, but it's not optimal

akadusei commented 7 months ago

As long as you are not checking out connections explicitly the pool should be able to recover from temporary outages.

From a cursory look at the code, the connection pool does not automatically retry a lost connection. To ensure a lost connection is retried, #retry(&block) must be called explicitly:

db.retry do
  db.query #...
end

Otherwise, all the retry_* options do nothing at all, it seems.

Perhaps PG::Statement should inherit DB::PoolStatement instead of DB::Statement?

bcardiff commented 7 months ago

The db.retry is not needed. If you do db.query (over the Database) the retry should happen automatically. See https://crystal-lang.org/reference/1.10/database/connection_pool.html#sample

I haven't validated if something is off with pg in this regard though.

akadusei commented 7 months ago

Oh, I must have missed that, then.

I get DB::ConnectionLost regularly with multiple apps in staging. This happens when those apps haven't seen any interaction in a while. I use a serverless database service provider that suspends the server after it's been idle for some time (it starts up after the first connection attempt).

Here's a trace, if that helps:

 (DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'exec_query'
  from /tmp/lucky/lib/avram/src/avram/queryable.cr:291:7 in 'results'
  from /usr/share/crystal/src/indexable.cr:749:5 in 'first?'
  from /tmp/lucky/lib/samba/src/presets/client.cr:7:3 in 'run_operation'
  from /tmp/lucky/src/actions/oauth/token/create.cr:4:3 in 'perform_action'
  from /tmp/lucky/lib/lucky/src/lucky/route_handler.cr:10:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/log_handler.cr:29:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/force_ssl_handler.cr:36:8 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/request_id_handler.cr:24:5 in 'call'
  from /usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'handle_client'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

I suspect the app attempts the first connection, fails and raises the error without a retry. I have tried various connection params to no avail. Currently its initial_pool_size=20&max_idle_pool_size=20&max_pool_size=20&retry_attempts=3&retry_delay=1. Restarting the apps seems to fix it, until it happens again.

They are Lucky apps, and it seems Avram does an explicit checkout here:

https://github.com/luckyframework/avram/blob/0b9797b918373615bafea54986849bb886ec239b/src/avram/database.cr#L188-L201

I don't see a retry there. Could that be it, or am I missing something?

will / crystal-pg

Connection pool unreliable? #196

Test-case:

Regardless of what settings you choose, you will see the following output: