zulip / zulip-flutter

Upcoming Zulip mobile apps for Android and iOS, using Flutter
Apache License 2.0
202 stars 195 forks source link

Retry storm on fetchOlder in message list #945

Open gnprice opened 2 months ago

gnprice commented 2 months ago

@alexmv reported the app on his device was sending a rapid series of getMessages requests to the server — around 6 such requests a second.

Because the server rate-limits requests (by default to 200 per minute for each user, total across all types of requests), the user-facing effect was that the requests all failed, with the symptom that scrolling up further in the message list didn't work. The requests also put load on the server, though mitigated by the rate-limiting.

We should fix this.

Diagnosis

From looking at the code, what's happening is:

Implementation

To fix the issue, we'll add backoff to MessageListView.fetchOlder, which is the method that invokes the API request.

When fetchOlder gets an error from the server, it should start a backoff timer (using our BackoffMachine so the intervals escalate) and ignore any further calls until the backoff expires, just like it ignores them when fetchingOlder is true.

We won't add any new retry logic. The fetchOlder call site, in _MessageListState._handleScrollMetrics, already provides plenty of retrying — all the user has to do is touch the screen to scroll slightly up or down, and fetchOlder will get called again. As Alex saw, it can also get called even with no interaction when nothing else seems to be happening.

Out of scope

A further bonus refinement would be for us to understand 429 responses, and wait for the time that that response says:

But to really do that right probably belongs at a different layer, in ApiConnection, so that the information is shared across all the different requests we make on behalf of a given account.

And conversely I think if we fix the rapid retries, that will largely solve the problem. I suspect the reason we ended up with 429s here in the first place was likely because of too-rapid retries after some other source of error. Even if not, exponential backoff should let us get out of the situation.