[R&D] Adaptative batch size on Preload

MathieuLamiot commented 9 months ago

Context It seems Preload can generate a lot of pressure on a server if the pages of the website are slow to open. A way to adapt to this would be to measure how long a request takes, and adapt batch size based on this.

What to do This branch is a quick&dirty example of how this could be implemented: https://github.com/wp-media/wp-rocket/tree/prototype/preload-adaptative-batch The idea is partially described here, but has evolved a bit to base the batch size on the measurement of a preload request, by making one request blocking from time to time.

A developer from the plugin team needs to go spend some time on this branch to make it work (maybe it is not, I just wrote the code to lay the idea down), production ready, and play with it to see how it behaves, possibly with logs. We have the gamma.rocketlabs.ovh website that suffers from CPU issues when doing a full cache clear to trigger the preload. It would be a good place to test it. See here.

Warning This branch would need https://github.com/wp-media/wp-rocket/issues/6394 Otherwise, we don't have control to prevent flooding the AS queue and the number of job in-progress could increase too quickly.

Khadreal commented 9 months ago

Improved the batch size work a bit from what @MathieuLamiot did, added transient for all requests and then used the value to determine the max and min size of the next batch requests.

MathieuLamiot commented 9 months ago

Thanks @Khadreal 🙏 I am not sure what would be the expected behavior for rocket_preload_previous_request_durations values 🤔 If I understand correctly:

every ~5 minutes, we'll check the duration of one request (behavior I introduced)
we will add this duration to rocket_preload_previous_request_durations (behavior you introduced)
we will use rocket_preload_previous_request_durations to define the preload batch size.

It seems to me that, with the current code, rocket_preload_previous_request_durations starts at 0 and will increase without upper limit every 5 minutes, so every 5 minutes, we will reduce the speed of preload. You might be missing a "rolling average" mechanism? Or am I missing something?

A rolling average could be implemented as follows (it's not the best way to do it, but it's the quickest one):

Replace: $previous_request_durations = $previous_request_durations + $duration; With:

if ($previous_request_durations <= 0) {
    $previous_request_durations = $duration;
} else {
   $previous_request_durations = $previous_request_durations*0.7 + $duration*0.3;
}

MathieuLamiot commented 9 months ago

I cleaned up a few things and added a dedicated logic based on transient to limit the number of blocking requests to 1 per minute. It gives good results with my local. I am trying to test on gamma, where we should be able to see a preload going way slower thanks to this, currently blocked because I can't write with the FTP access 🤷

To easily monitor, I add the following:

error_log(sprintf('preload_url: duration %s averaged %s', $duration, $previous_request_durations)); after $check_duration = false;

error_log(sprintf('process_pending_jobs: batch size %s averaged %s', $next_batch_size, $preload_request_duration)); before $next_batch_size = min( $next_batch_size, $max_batch_size);

@piotrbak
In the the current branch:

First time we prepare a preload batch, we start with the minimum batch size (currently 5 URL, there is a filter). Next time, the batch size will be based on the average request time (see below). The formula currently is $next_batch_size = round((1/$preload_request_duration) / 2 * 60); which gives 30 jobs per batch on my local with 1s request duration. The formula might be too conservative, but we can easily change it. There is a min/max also to avoid going too fast/too slow (5 and 45 for now).
some preload requests are now blocking to allow us to measure the time. We limit the number of blocking request to 1 per minute so that the preload non-blocking approach remains the majority of cases.
Each blocking request allow us to update $preload_request_duration.

While we finalize testing, we would need your inputs on:

How conservative do you want to be? For instance, 1s duration is currently already lowering the batch size. Maybe it shouldn't, and we should have a more aggressive formula?
How do you want to release this adaptive feature? By default? Behind a filter? Something else? Note that with the min/max filter, one can already "force" a batch size by setting min = max. The question is mostly, should we do 45 = min = max by default, or not.

MathieuLamiot commented 8 months ago

After running tests on gamma website and locally, I adjusted the formula so that we don't impact much "normal" website but provide a batch reduction when the website is slow (typically more than 3 or 4 seconds on average per request starts to reduce significantly the preload). I reduced the timeframe over which the transient of average duration is kept to allow to quickly adapt in case the website performances change quickly (which is the case with gamma for instance).

I opened a PR to keep track, but we'll need AC or at least NRT plans here, and some rework of the unit/integration tests. I manually tested as much as possible and preloads seems to be going well.

Just one question, as I am not sure about how Preload and RUCSS work together: if preload is slowed down (let's say batch size is 5 instead of 45), does it have any impact on the rate at which we'll add RUCSS jobs to the table and send them? I don't think so, but wanted a confirmation @wp-media/engineering-plugin-team

MathieuLamiot commented 8 months ago

@Khadreal Can you take over this issue for the completion?

Need to get an answer about this:

Just one question, as I am not sure about how Preload and RUCSS work together: if preload is slowed down (let's say batch size is 5 instead of 45), does it have any impact on the rate at which we'll add RUCSS jobs to the table and send them? I don't think so, but wanted a confirmation @wp-media/engineering-plugin-team

Adapt the built-in tests
Prepare the PR

MathieuLamiot commented 7 months ago

Summary of the functional behavior of the implemented solution, as of now

Functional behavior

The number of URLs to preload per batch becomes variable. It is now adjusted with the time it takes to load a page. This time is estimated by frequently measuring how long a preload requests takes, and doing an average over time. The impact is that, on websites where loading an uncached page takes more than 2 seconds, the batch size will be reduced and hence, preload will take longer.

Preparing a batch

When preparing a preload batch, the plugin computes the batch size based on rocket_preload_previous_request_durations transient (estimation of how long it takes to load a page). There are safeguards so that: the count of pending actions in AS is never above rocket_preload_cache_pending_jobs_cron_rows_count filter, and if possible, that the batch size is at least rocket_preload_cache_min_in_progress_jobs_count filter. In case there is no estimation available (first time using this feature, or first preload since at least 5 minutes), then the batch size is the minimum one by default.

Sending preload requests

When sending a preload request, if it has been more than 1 minute since the last estimation, we make the request blocking and measure how long it takes to return. The measured time is used to update the rocket_preload_previous_request_durations transient, with an expiration of 5 minutes. When a new estimation is done, we set rocket_preload_check_duration transient, with an expiration of 60 seconds. As long as this transient is set, no new estimation will occur.

Controlling the feature

Currently, this feature is applied by default.

Bypassing the feature

Bypassing the feature means having a constant preload batch size. In the current implementation, to do this, one must set those filters to the same value, being the desired preload batch size: rocket_preload_cache_min_in_progress_jobs_count, rocket_preload_cache_pending_jobs_cron_rows_count. Note that the estimation of the loading time will still be performed.

List of filters

rocket_preload_cache_min_in_progress_jobs_count: New. Integer. Default: 5. Minimum number of URLs per batch. A batch can be smaller only if the AS queue is almost full (see rocket_preload_cache_pending_jobs_cron_rows_count)
rocket_preload_cache_pending_jobs_cron_rows_count: Already introduced. Integer. Default: 45. Target size of the AS preload queue. The batch size will never exceed (this value) - (number of tasks currently in the AS queue for preload).

List of transients

rocket_preload_previous_request_durations: Current estimation of the time to load a page. Expiry: 5 minutes.
rocket_preload_check_duration: Set if a load time estimation has been done less than a minute ago. expiry: 1 minute.

MathieuLamiot commented 4 months ago

@piotrbak Are there changes required to release this compared to the functional description in my last comment above?

piotrbak commented 4 months ago

@MathieuLamiot I think there's nothing to be added here. Just to confirm, if we set rocket_preload_cache_min_in_progress_jobs_count to be equal to the rocket_preload_cache_pending_jobs_cron_rows_count, the preload theoretically won't be slowed down even if the response time is big, right?

MathieuLamiot commented 4 months ago

Just to confirm, if we set rocket_preload_cache_min_in_progress_jobs_count to be equal to the rocket_preload_cache_pending_jobs_cron_rows_count, the preload theoretically won't be slowed down even if the response time is big, right?

Yes, correct.

Are you OK with the following:

rocket_preload_cache_min_in_progress_jobs_count set to 5 by default (to be compared with the "normal" batch size currently: 45).
At first, the batch size will be set to rocket_preload_cache_min_in_progress_jobs_count, so slow at the beginning (conservative approach).

piotrbak commented 4 months ago

At first, the batch size will be set to rocket_preload_cache_min_in_progress_jobs_count, so slow at the beginning (conservative approach).

But then, how it'll be increased to the regular size if the request time is less than 2s? In what steps?

@DahmaniAdame pinging you about the 2s loading time, if it's bigger, we treat the website as heavy loaded. Also the 5 as a number of minimal batch for loaded websites.

DahmaniAdame commented 4 months ago

A starting batch of 5 is reasonable. It will help low-resource setups to not get overloaded after activating preload, and build up if there are enough resources to process more. From what I can see, the 2 seconds is more of a scenario and not a cap. And it's likely what's going to be for most of our users. It might be slow, but going beyond will result in a high load. The setting can still be overruled by the batch size filter I assume? just to give a way to bypass it for users still feeling it's not enough and understanding the resources impact resulting from not using the automated/recommended preload pace.

MathieuLamiot commented 4 months ago

The setting can still be overruled by the batch size filter I assume?

See this:

Bypassing the feature Bypassing the feature means having a constant preload batch size. In the current implementation, to do this, one must set those filters to the same value, being the desired preload batch size: rocket_preload_cache_min_in_progress_jobs_count, rocket_preload_cache_pending_jobs_cron_rows_count. Note that the estimation of the loading time will still be performed.

One would be able to force a batch size regardless of the loading time reported. We cannot change the 2s value currently if someone wants to "adapt the formula".

MathieuLamiot commented 4 months ago

To whoever picks this up, this needs:

To be updated with develop.
To add test coverage for the new feature (I think we just adapted the existing tests, but did not cover the new things).
Need discussion with QA to define test plans.

wp-media / wp-rocket