zulip / zulip-mobile

Zulip mobile apps for Android and iOS.
https://zulip.com/apps/
Apache License 2.0
1.28k stars 643 forks source link

Sending outbox messages is fraught with issues #3881

Open rk-for-zulip opened 4 years ago

rk-for-zulip commented 4 years ago

There is a small constellation of intertwined problems with message-sending.

gnprice commented 4 years ago
  • If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.

Huh fascinating! How does this one happen?

  • Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

Is that the kind of scenario you have in mind, or is there another path?

All of these would be good to address. I imagine a single solution may address several of them at once.

gnprice commented 4 years ago

See also #3829, #3731, #3584, #2374. And #3247 is an issue with a similar nature on a different codepath: double-sends of posting an emoji reaction to a message.

gnprice commented 4 years ago

OK, and in particular:

  • Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

This one is #2374. There's some nice discussion on that thread, too.

  • Server-rejected messages are not marked as such and will be repeatedly re-sent, wasting bandwidth and battery.

I think this is the same issue as #3731, describing another aspect of the symptoms. (Which would be a nice point to add to that thread.)

rk-for-zulip commented 4 years ago
  • If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.

Huh fascinating! How does this one happen?

outboxSending is stored in Redux. If the app is dehydrated with that set to true, then killed, there is no mechanism that will clear it.

On reread, I think that's actually not possible at the moment... but only thanks to other bugs. tryUntilSuccessful doesn't await (it's not even async), and since it always returns true, sendOutbox will never await either. I don't think it's currently possible for the dehydration code to run while outboxSending is true.

At least, not unless there's an exception-generating bug in tryUntilSuccessful somewhere. That could do it.

  • Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

Is that the kind of scenario you have in mind, or is there another path?

There is, sadly. The message-sending tasks are completely loose – their conclusion is unordered with the rest of the program, so there's no sequencing between them and outboxSending being unset. In particular, if we get two calls to sendOutbox in synchronous succession, the message-sending tasks will all be fired twice before any send-attempts are made.

This can all be fixed on our end, at least in theory. The scenario you described is another matter, and may require API changes to prevent. (Though EVENT_NEW_MESSAGE should provide some mitigation.)

All of these would be good to address. I imagine a single solution may address several of them at once.

I would be very wary of any solution that did not.

rk-for-zulip commented 4 years ago

I think this is the same issue as #3731, describing another aspect of the symptoms. (Which would be a nice point to add to that thread.)

They're certainly closely related, but only because they'll need essentially the same data to be stored in order to fix. (See below.) It's otherwise technically possible to fix either without the other – and in particular I wouldn't want the fix for the network behavior to be blocked on having an appropriate display form.

It's probably appropriate to copy my comment there over here, though:

From offline brainstorming: perhaps there should be an optional httpResponse field in Outbox. [...]

We could also give httpResponse a special non-numeric "too old to send" value [...] and retain messages that would have been dropped due to age [...].

gnprice commented 4 years ago

outboxSending is stored in Redux. If the app is dehydrated with that set to true, then killed, there is no mechanism that will clear it.

Aha. There's actually a good reason this can't happen. That flag specifically appears at state.session.outboxSending... and state.session is among the parts of our Redux state we specifically don't have redux-persist put into persistent storage.

Quoting from src/boot/store.js:

/**
 * Properties on the global store which we explicitly choose not to persist.
 *
 * All properties on the global store should appear either here or in the
 * lists of properties we do persist, below.
 */
// prettier-ignore
export const discardKeys: Array<$Keys<GlobalState>> = [
  'alertWords', 'caughtUp', 'fetching',
  'nav', 'presence', 'session', 'topics', 'typing', 'userStatus',
];

That makes state.session an appropriate spot for information, like this, that's about what the current live app process is actively doing.

gnprice commented 4 years ago

The message-sending tasks are completely loose – their conclusion is unordered with the rest of the program, so there's no sequencing between them and outboxSending being unset. In particular, if we get two calls to sendOutbox in synchronous succession, the message-sending tasks will all be fired twice before any send-attempts are made.

I see. I think you're referring to the fact that this await:

export const trySendMessages = (dispatch: Dispatch, getState: GetState): boolean => {
  // ...
  try {
    outboxToSend.forEach(async item => {
      // ...
      await api.sendMessage(auth, {

is in a forEach callback, and so nothing actually awaits the promise that that async function returns. And as a result here:

export const sendOutbox = () => async (dispatch: Dispatch, getState: GetState) => {
  // ...
  dispatch(toggleOutboxSending(true));
  while (!trySendMessages(dispatch, getState)) {
    await progressiveTimeout(); // eslint-disable-line no-await-in-loop
  }
  dispatch(toggleOutboxSending(false));
};

we set outboxSending back to false without actually waiting for any of those api.sendMessage promises to complete.

One nuance:

the message-sending tasks will all be fired twice before any send-attempts are made.

I don't think this is quite right. They'll all be fired twice before any responses to the send-attempts are received. But remember that await foo() is await (foo()), i.e. the foo() function call happens before any awaiting is needed... and that when a JS async function is called, the function body starts executing immediately, just like a non-async function, and yields only at await or return.

So it should be the case that when you call api.sendMessage(..), it synchronously gets all the way to a fetch call. (If not, that's probably a bug in our code, and kind of a surprising one.) And that, similarly, will fire off an actual network request -- then return a promise which will get resolved (or rejected) when the request concludes, and ultimately api.sendMessage(..) returns some promise chained off of that one.

Still, firing twice before seeing any responses is clearly not the desired behavior here either.

gnprice commented 4 years ago

Looking again at this code, I think the basic problem is that that outboxToSend.forEach loop is a forEach loop, and the code (as quoted in my previous comment) was clearly written with the expectation it would behave like a for loop.

In particular the boolean return from trySendMessages is almost completely useless given the way the code actually works -- the only way to make sense of the intention here is that it wires up a failed request in api.sendMessage to cause us to do an await progressiveTimeout(). An exception in the top half of that forEach callback, before reaching the await, (a) is a lot less likely, (b) would be a bug in our code, not an external condition like a bad network connection, (c) would therefore make no sense to retry and less sense to back off before retrying.

But also, if that loop is changed to be a real JS loop so that the awaits actually happen in sequence, then I believe that fixes three of the points mentioned in the OP:

  • All outbox messages are sent immediately, with no throttling.
  • Outbox messages may be sent in any order, regardless of their input order.
  • Outbox messages may be successfully sent multiple times.

(For the third, it fixes the cause of it you had in mind, though not the other cause I then mentioned above.)

So, seems like we should do that!

I also went and took a look at the history of this file, following my speculations above about what the authors were thinking. :wink: The logic became the way it is in #3272 -- which in particular was supposed to fix #3259, sending messages out of order. Before that, #1079 introduced the forEach(async ..) confusion.

rk-for-zulip commented 4 years ago

So, seems like we should do that!

Except that if we do – or at least, if we just do that – then that causes another bug to surface: the presence of an unsendable message in the Outbox will prevent any other messages from being sent for a week, after which all sendable messages (up to the next unsendable message, anyway) will suddenly be flushed to the server at once.

(This would have been worse before the recent 'discard unsendable messages after a week' change. Back then it would just have silently prevented any further messages from ever being sent at all.)

gnprice commented 4 years ago

Aha. There's actually a good reason this can't happen. [...] state.session is among the parts of our Redux state we specifically don't have redux-persist put into persistent storage.

This is a pretty subtle aspect of how the app works. I've just pushed 664ee092e, which adds a bit more discussion of this in a place which hopefully helps make it somewhat easier to discover.

gnprice commented 4 years ago

the presence of an unsendable message in the Outbox will prevent any other messages from being sent for a week

Yeah, we should behave differently on

In the latter case, we should retry with backoff, and keep other outbox messages in the queue behind this one, along the lines of what this code was intended to do.

In the former, we should treat the message as permanently unsendable. One more-immediate thing we could do, to unblock fixing the rest of the logic without regressing anything, would be: on a 4xx status, move on to the next outbox message in the queue, but leave this one there, basically like we (accidentally) do in the current code.

chrisbobbe commented 3 years ago

There's been quite a bit of productive discussion around here. The main piece of prep work we'd like to get done for our solution is #4193, which introduces a handy promiseTimeout function. We can use that function to time out a send-message request.

Marking as blocked by #4193, then, as noted in chat.

gnprice commented 2 years ago

Marking as blocked by #4193, then

That PR was merged (after being split in two: #4753, #4754), so unblocking.

gnprice commented 2 years ago

I think the next step, when we pick this thread of work up again, will be to take that chat thread from 2020-12 and make a fresh draft of the proposed design. (@chrisbobbe sent a design up at the top of the thread, and then there was further discussion but nobody has compiled a revised draft all in one place.)

That will include a state diagram, akin to this one, and a type definition, like this one crossed with the full draft from the start of the thread.

chrisbobbe commented 2 years ago

I think the next step, when we pick this thread of work up again, will be to take that chat thread from 2020-12 and make a fresh draft of the proposed design.

I've picked this back up again, here. I've written a new state diagram but I'd like to go through some concerns with it before doing the other parts.

chrisbobbe commented 2 years ago

Here's our latest state diagram, from discussion:

            User cancels the scheduled send.
           ┌────────────────────────────────────────────────────────────────────────┐
           │                                                                        │
           │                                                                        │
           │                                   User cancels during send (#4170).    │
           │                                  ┌───────────────────────────────────┐ │
           │                                  │                                   │ │
           │                                  │                                   │ │
           │                                  │                Event received,    │ │
(create)   │      Time for the scheduled      │                or we abandoned    │ │
  │        │      send (a try/retry).         │    200.        the queue.         ▼ ▼
  └► should-send ───────────────────────► sending ─────► sent ────────────────► (delete)
       │ ▲ ▲                                │ │                                     ▲
       │ │ │                                │ │                                     │
       │ │ │ App quit: schedule auto-retry. │ │                                     │
       │ │ │                                │ │                                     │
       │ │ │ 5xx, network error, or (with   │ │                                     │
       │ │ │ #4170) 60s network timeout:    │ │                                     │
       │ │ │ schedule auto-retry with       │ │                                     │
       │ │ │ backoff.                       │ │ 4xx.                                │
       │ │ └────────────────────────────────┘ └───────────────────────┐             │
       │ │                                                            │             │
       │ │                                                            │    User     │
       │ │ User requested a retry; schedule to run immediately.       ▼    cancels. │
       │ └──────────────────────────────────────────────────────── failed ──────────┘
       │                                                              ▲
       │                                                              │
       │ Too old: message is "better never than late". Too much time  │
       │ from creation or last user-requested retry.                  │
       └──────────────────────────────────────────────────────────────┘
sevmonster commented 1 year ago

I just had a situation where a user sending 4 messages turned into 20 due to a connectivity issue with the server... I assume that #5525 is also related to this.