Cleanup failed instances (connections)

ysbaddaden commented 8 years ago

The pool has no way to know that a connection failed and must be either resetted or simply removed from the pool. Yet, we could at least provide means to remove a connection from the pool, so we can cleanup failed connections, allowing the pool to create new, and valid, connections.

ray-delossantos commented 8 years ago

Hello @ysbaddaden, any idea on how to recycle the connections every n minutes or when there's an IO:Timeout Exception?

ysbaddaden commented 8 years ago

You may have a coroutine that sleeps for n minutes then resets connections, or you may wrap your code and catch IO::Timeout exceptions to reset connections.

ray-delossantos commented 8 years ago

Thanks @ysbaddaden.

asterite commented 5 years ago

@ysbaddaden Could you provide some code that triggers this behavior? I'd like to try to fix it.

ysbaddaden commented 5 years ago

It's theoretical, and maybe that's not a problem. The ConnectionPool Ruby gem states:

There is no provision for repairing or checking the health of a connection; connections should be self-repairing.

I think it's fine. If I have a pool of Redis clients, each Redis client should detect that a connection has failed and should automatically reconnect.

asterite commented 5 years ago

Oh, I see what you mean. Yes, each connection should try to reconnect itself.

ysbaddaden commented 5 years ago

Thought this TODO is somewhat related:

TODO: reap connection of dead coroutines that didn't checkin (or died before)

ConnectionPool(T) attaches a connection from the pool to the current Fiber, so there is the possibility that a fiber would be terminated before returning a connection to the pool, so there should be a reaper fiber to release the connection of dead fibers —but we currently don't monitor the fiber state.

asterite commented 5 years ago

I'm not sure that's a responsibility of the pool. If, say, an HTTP::Server handles a request and asks a connection from the pool, I would expect there to be an exception handler around the entire handler block that ensure the connection is returned afterwards. Then if a fiber dies (unhandled exception) the ensure will be executed before that. Right?

ysbaddaden commented 5 years ago

Yes, and using ConnectionPool#connection(&block) will release it automatically, too. But Rails' ActiveRecord has a reaper that will release connections held by crashed threads, for example, because it's better to be safe than sorry.

asterite commented 5 years ago

because it's better to be safe than sorry

Makes sense. But how can we do it in Crystal? We can have a periodic check... but what do we check? We can't know whether a Fiber is dead or alive just by knowing its object id.

mperham commented 5 years ago

But Rails' ActiveRecord has a reaper that will release connections held by crashed threads, for example, because it's better to be safe than sorry.

AFAIK Rails only has this for legacy reasons: the AR API can be used outside of a block (e.g. User.find(1)) so there's no explicit checkin of any used connections. If your API is properly block-scoped with ensure, you don't need background cleanup, e.g.:

User.query do |u|
  u.find(1)
end

jwoertink commented 2 years ago

Just ran in to this issue with our production app. We're using https://github.com/stefanwille/crystal-redis which pulls in this shard. Our Redis instance seemed to just die randomly, and none of the connections would autoreconnect once it came back up. We had to just reboot the entire app.

I know @robcole was running in to a similar issue with crystal-db and was looking in to a reaper type functionality https://github.com/the-business-factory/crystal-db/pull/1 I wonder if that would be a good way to go here?

robcole commented 2 years ago

To elaborate a bit on @jwoertink's post, the issue I was running into was that socket (both TCP and Unix) connections would be disconnected by a firewall, docker, or system settings on various hosts (fly.io and render.com were the two ones I tested; fly.io uses TCP sockets and render uses Unix ones). Because those disconnects were server-side, the client wasn't aware that the connections in the pool had become disconnected, so they're still treated as open sockets/connections.

Most pools in various languages handle this with rescuing the failed connection, ejecting it from the pool, and retrying -- this adds latency spikes and in my apps was actually triggering DB::ConnectionLost errors (probably due to retry_attempts + retry_delay not being able to get a successful retry in before timing out, but I'm not entirely certain).

Golang and Java both have ConnectionPools that introduced this concept -- I picked up on it from discussion in https://github.com/crystal-lang/crystal-db/pull/109.

I've been running the pool for a few days in prod against a Lucky+Avram+PG app and it's working great -- 0 errors and a significantly decreased p95 time (since pools are almost always fully available).

@jwoertink - Happy to help you implement a version of this in this pool if you'd like to lead a pair on it at some point -- it's pretty hard to manually create events that would lead to the disconnects in PG, so testing it can be a bit difficult, but otherwise I feel pretty solid about it the approach as it's fairly simple.

jwoertink commented 2 years ago

We're testing a few things, but it's fairly easy for us to check this. We boot the app, then restart redis and see Socket::ConnectError.

ysbaddaden / pool

Cleanup failed instances (connections) #2