zulip / zulip-mobile

Zulip mobile apps for Android and iOS.
https://zulip.com/apps/
Apache License 2.0
1.3k stars 655 forks source link

Time out sending a message, and indicate failure #3829

Open gnprice opened 4 years ago

gnprice commented 4 years ago

Currently, when we try to send a message and can't, we keep the message around (as an Outbox message) and keep retrying indefinitely until it succeeds.

(Specifically, we do so in a loop with backoff, then at each subsequent message send and at each "initial fetch", so in particular at each launch of the app. See sendOutbox and its call sites.)

This isn't great, because most messages are eventually better to send never than send late. In particular a short conversational reply like "sure" or "no, sorry" may have a completely different meaning if the discussion has moved on.

Instead, when the oldest outbox message reaches a certain age (perhaps 60 seconds? 120 seconds?) and we've been unable to send, we should stop trying to send it; give the user feedback that it failed; and give an option to retry the message or delete it. This is similar to the experience in an SMS client, or in SMS-style chat apps like FB Messenger.

The feedback is important (so that users can be confident that messages are delivered whenever they don't see the feedback), and a bit tricky UI-wise:

gnprice commented 4 years ago

See also #3731 .

rk-for-zulip commented 4 years ago
  • On tap (or long-press), we'd show an action sheet with two options labelled like "Retry" and "Delete".

Also "Edit message", I think.

  • For us: maybe treat the failed-send message as unread?

That would probably be better than nothing, but ideally we should have a distinction (visible and otherwise) between the two. Notably, if the user sends a message and then mutes the relevant topic, they'll probably still want to be informed that the message couldn't be sent.

chrisbobbe commented 4 years ago

This may warrant a separate issue, but I wonder if progressiveTimeout, which is used here, a) might be improved with the option to make the returned promise reject after a certain number of tries (or on a duration-based condition), and b) is doing what it's currently intended to do.

a) is more related to this issue. We could handle message sending failures locally, by breaking out of the loop and not calling progressiveTimeout after a certain condition. Or, we could build into progressiveTimeout an option to reject after a threshold is reached (maybe based on the number of tries, or most recent or cumulative duration), which makes sense conceptually as something we may want to do at other sites. Before doing this, I think there are some implications of the global nature of progressiveTimeout that would become more urgent to tease out. On the one hand, the fact that it's global is nice because, as long as progressiveTimeout is used at all appropriate call sites, it throttles the volume of all requests, not just those from a particular site, so it reduces general network activity and pressure on the server. However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet. Thinking about the global nature of progressiveTimeout becomes more urgent if we decide to reject the promise, say, when the most recent timeout from any call to progressiveTimeout was 60 seconds (or at some other threshold).

For b), it appears that progressiveTimeout's current intention is, when run in a loop, to do an exponential backoff up to and including a delay that exceeds 60s, and then reset to zero and repeat. This resetting is explained at the end of the JSDoc, but there may be an argument for making that note more prominent, as I didn't assume it would reset to zero:

/**
 * Sleep for a timeout that progressively grows in duration.
 *
 * Starts at 0 on the first call.  Grows initially by a small increment,
 * then exponentially, up to a maximum.
 *
 * If the last call was over 60s ago, starts over at 0.
 */

In any case, from reading the code and as confirmed empirically, this resetting never actually happens. The resetPeriod of 60 seconds would trigger the reset, but the duration never reaches that 60-second threshold because the maxDuration is 10 seconds. After a brief period of proper backoff, progressiveTimeout just becomes a 10-second timer. (EDIT: N.B., since it's global, as mentioned earlier, this looks to be a 10-second timer on all requests that use progressiveTimeout, every time, after that initial backoff.)

chrisbobbe commented 4 years ago

However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet.

Hang on. Even if request A succeeds after several increasing progressiveTimeouts, request B's first timeout will not be zero, right — it'll be longer than A's last one!

gnprice commented 4 years ago

The resetPeriod of 60 seconds would trigger the reset, but the duration never reaches that 60-second threshold because the maxDuration is 10 seconds. After a brief period of proper backoff, progressiveTimeout just becomes a 10-second timer.

The intention is that it backs off up to a maximum interval (10s), then stays there as long as we're in a continuous retry loop; and then in the future, after that retry loop is over and things might be all working again, starts the backoff sequence over from zero.

The 60s thing isn't super principled, but I think it can basically be seen as an attempt to implement "that retry loop finished, one way or another, and this is a new retry loop where we should start by assuming the best."

Before doing this, I think there are some implications of the global nature of progressiveTimeout that would become more urgent to tease out. On the one hand, the fact that it's global is nice because, as long as progressiveTimeout is used at all appropriate call sites, it throttles the volume of all requests, not just those from a particular site, so it reduces general network activity and pressure on the server. However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet.

Yeah, I agree. I think it'd be better for it to track its state for one retry loop at a time, rather than globally.

A good way to structure that might be to stick the state in a closure, something like

export const makeBackoffMachine = () => {
  let lastTimeoutTime = 0;
  let lastDuration = 0;

  return {
    wait(): Promise<void> {
      // ... the body of the existing `progressiveTimeout` ...
    }
  };
};

and then callers would look like

const backoffMachine = makeBackoffMachine();
while (true) {
  // ...

  await backoffMachine.wait();
}

Hmm, yeah, and this should probably be a separate issue thread. :wink:

chrisbobbe commented 4 years ago

OK, thanks for the clarification and suggestion! I'm working on an implementation of this; I'll post in a new issue soon. I think I understand the 60s much better now. I believe, with the suggested implementation, that "a new retry loop" in

"that retry loop finished, one way or another, and this is a new retry loop where we should start by assuming the best"

will be determined much more easily and accurately (i.e., when a new backoff machine has been made), so the 60s heuristic won't actually be needed, and it'll all be clearer and more deterministic without it.

Opened as https://github.com/zulip/zulip-mobile/issues/3840.

chrisbobbe commented 4 years ago

3841 has been merged, clearing the way for work on this issue. Anyone interested in claiming this?

rk-for-zulip commented 4 years ago

Let's say me. (I'm already working on #3881, which has a fair range of intersection, and I think if someone else claimed it we'd be stepping on each other's toes a lot.)