Open m-o-e opened 4 years ago
As long as you are not checking out connections explicitly the pool should be able to recover from temporary outages. Sample in mysql https://crystal-lang.org/reference/database/connection_pool.html
Maybe an exception with the wrong type is raised and the connection loss is not detected.
But you snippets looks good and i would expect to be resilient.
Possibly it's postgres specific?
I couldn't find a way to make my above snippet resilient - it always breaks when the server is restarted.
Yeah, I've encountered similar errors with postgres as well, but never did any deeper investigation. So maybe it's an issue in crystal-pg
.
also finding this an issue working with GCP's hosted postgres compatible database and new connections take upwards of 2 seconds to connect meaning we really need to rely on the pool.
So we have the pool configured with max_idle_pool_size
== max_pool_size
== 50
but if nothing is going on for awhile the connections eventually time out and broken sockets are returned causing 500 errors in our web app DB::ConnectionLost
errors
Also seeing the pool not recovering from DB outages - we work around that using health checks, but it's not optimal
As long as you are not checking out connections explicitly the pool should be able to recover from temporary outages.
From a cursory look at the code, the connection pool does not automatically retry a lost connection. To ensure a lost connection is retried, #retry(&block)
must be called explicitly:
db.retry do
db.query #...
end
Otherwise, all the retry_*
options do nothing at all, it seems.
Perhaps PG::Statement
should inherit DB::PoolStatement
instead of DB::Statement
?
The db.retry
is not needed. If you do db.query
(over the Database
) the retry should happen automatically. See https://crystal-lang.org/reference/1.10/database/connection_pool.html#sample
I haven't validated if something is off with pg in this regard though.
Oh, I must have missed that, then.
I get DB::ConnectionLost
regularly with multiple apps in staging. This happens when those apps haven't seen any interaction in a while. I use a serverless database service provider that suspends the server after it's been idle for some time (it starts up after the first connection attempt).
Here's a trace, if that helps:
(DB::ConnectionLost)
from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'exec_query'
from /tmp/lucky/lib/avram/src/avram/queryable.cr:291:7 in 'results'
from /usr/share/crystal/src/indexable.cr:749:5 in 'first?'
from /tmp/lucky/lib/samba/src/presets/client.cr:7:3 in 'run_operation'
from /tmp/lucky/src/actions/oauth/token/create.cr:4:3 in 'perform_action'
from /tmp/lucky/lib/lucky/src/lucky/route_handler.cr:10:7 in 'call'
from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
from /tmp/lucky/lib/lucky/src/lucky/log_handler.cr:29:7 in 'call'
from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
from /tmp/lucky/lib/lucky/src/lucky/force_ssl_handler.cr:36:8 in 'call'
from /tmp/lucky/lib/lucky/src/lucky/request_id_handler.cr:24:5 in 'call'
from /usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'handle_client'
from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
from ???
I suspect the app attempts the first connection, fails and raises the error without a retry. I have tried various connection params to no avail. Currently its initial_pool_size=20&max_idle_pool_size=20&max_pool_size=20&retry_attempts=3&retry_delay=1
. Restarting the apps seems to fix it, until it happens again.
They are Lucky apps, and it seems Avram does an explicit checkout here:
I don't see a retry there. Could that be it, or am I missing something?
Currently
DB::Database
does not recover from temporary connection failures. It does not seem to evict broken connections from the pool but continues to attempt to use them.This means just a single brief outage will leave the pool in a permanently broken state.
The only way to recover from connection failures appears to be to discard the entire
DB::Database
instance and create a new pool.Is this a bug or am I using it wrong? 🤔
Test-case:
Just leave the above running and while it runs restart your local postgres server.
Regardless of what settings you choose, you will see the following output: