I came across this issue running a heavy load against spanner, with multiple tasks using the same client. The root cause is a race in the timeout passing a session to a waiter via a oneshot.
In acquire, it creates a oneshot that which it adds to the waiters and then receives from the oneshot under a timeout. When a session becomes free it is sent via the oneshot, and acquire gets a session.
The problem is if there's a race with the timeout. The timeout can complete, and that same instant a session becomes free and is placed in the oneshot. Since the timeout completed, nothing ever receives the session, so it is leaked. Most importantly, the in_use is never updated, so the session appears to be still in_use even though it has been dropped. When this happens enough times we end up leaking all sessions, and no more are created because we've hit max open.
The solution is to not send the session through the oneshot. Instead use the oneshot to notify the waiter that a session is available in available_sessions. It can then try to take from there. This is wrapped in a loop to allow for racing with other acquire calls.
I came across this issue running a heavy load against spanner, with multiple tasks using the same client. The root cause is a race in the timeout passing a session to a waiter via a oneshot.
In acquire, it creates a oneshot that which it adds to the waiters and then receives from the oneshot under a timeout. When a session becomes free it is sent via the oneshot, and acquire gets a session.
The problem is if there's a race with the timeout. The timeout can complete, and that same instant a session becomes free and is placed in the oneshot. Since the timeout completed, nothing ever receives the session, so it is leaked. Most importantly, the in_use is never updated, so the session appears to be still in_use even though it has been dropped. When this happens enough times we end up leaking all sessions, and no more are created because we've hit max open.
The solution is to not send the session through the oneshot. Instead use the oneshot to notify the waiter that a session is available in available_sessions. It can then try to take from there. This is wrapped in a loop to allow for racing with other acquire calls.