wyyerd / stripe-rs

Rust API bindings for the Stripe HTTP API.
Apache License 2.0
219 stars 88 forks source link

Requests fail intermittently w/ hyper error #173

Closed stearnsc closed 3 years ago

stearnsc commented 3 years ago

On a small portion of requests, we're getting the error: error communicating with stripe: connection closed before message completed and the request is failing.

This appears to be related to https://github.com/hyperium/hyper/issues/2136 - in which hyper attempts to reuse a connection that the server (in this case Stripe) has already closed.

I opened https://github.com/wyyerd/stripe-rs/pull/172 to trial disabling connection pooling to investigate, and after pinning this library to that revision in our production environment, the errors stopped.

In my mind we have 3 viable solutions:

  1. We disable connection pooling entirely, as done in #172
  2. We determine Stripe's keepalive timeout, and set our timeout to less than that in order to prevent the race condition
  3. We implement retry logic of some sort

I'm reasonably concerned about (3) as a solution, because if we incorrectly retry a request that has succeeded without using idempotency keys, we open our users up to double-charging their customers, which is obviously Very Bad™.

I think right now I'm in favor of (1) as a short-term fix, and then possibly moving to (2) if/when we can determine how stripe handles such connections with confidence. Alternatively, we could disable connection pooling by default, but allow it to be enabled through configuration, which puts the burden on our users to configure it correctly, handle retries, etc.

arlyon commented 3 years ago

172 seems like a decent short term solution. I am currently speaking with stripe to see if they can give a proper number (for solution 2) which you could test with; i'll comment when they get back to me.

edit: they're handling the issue internally and will send me an email. even if we decide to go with solution 1, setting a sane timeout is a clear benefit anyways

arlyon commented 3 years ago

Contents of the email, reworded because I'd rather not post verbatim:

Recommended solution from stripe: set the timeout limit to 30 seconds, the same value used in their java library (https://github.com/stripe/stripe-java/blob/4262fcc6e4b0f084c03de438886b2a9f0fbea306/src/main/java/com/stripe/Stripe.java#L9).

It was also mentioned we can (and probably should, eventually) also set up retry logic which would be a nice-to-have, though I agree with your original concerns here. In this case, we can make use of idempotency headers like they do on their node client (https://github.com/stripe/stripe-node/blob/e89cc76ecbf64979cc012d81ab299eee32d8500d/lib/StripeResource.js#L302-L335) which are applied automatically if the caller doesn't supply them.

arlyon commented 3 years ago

Would you be willing to set the timeout to 30 sec and test? I can make some time to implement idempotent retries however that will obviously need a good deal of testing before it can be put in production. Maybe we can use the mock stripe docker image with some load generator to count dropped connections and detect duplicate charges.

stearnsc commented 3 years ago

Sure; I should have time to test that this afternoon.

arlyon commented 3 years ago

Just received a follow up: "while we terminate idle connections after 180s, we may terminate them early for a number of reasons (eg: host management) and you should not explicitly depend on a connection staying open that long". I think retries and idempotency seems like the only bulletproof solution; there is a PR https://github.com/wyyerd/stripe-rs/pull/130 which makes some progress but it is a year old.

stearnsc commented 3 years ago

I'm going to go ahead and merge the "stop connection pooling" change until we implement idempotency. I'll make a followup story for that.

stearnsc commented 3 years ago

Made #176 to track the ongoing issue

stearnsc commented 3 years ago

Closed by #172