wp-media / wp-rocket

Performance optimization plugin for WordPress
https://wp-rocket.me
GNU General Public License v2.0
698 stars 218 forks source link

[R&D] Adaptative batch size on Preload #6396

Closed MathieuLamiot closed 3 months ago

MathieuLamiot commented 9 months ago

Context It seems Preload can generate a lot of pressure on a server if the pages of the website are slow to open. A way to adapt to this would be to measure how long a request takes, and adapt batch size based on this.

What to do This branch is a quick&dirty example of how this could be implemented: https://github.com/wp-media/wp-rocket/tree/prototype/preload-adaptative-batch The idea is partially described here, but has evolved a bit to base the batch size on the measurement of a preload request, by making one request blocking from time to time.

A developer from the plugin team needs to go spend some time on this branch to make it work (maybe it is not, I just wrote the code to lay the idea down), production ready, and play with it to see how it behaves, possibly with logs. We have the gamma.rocketlabs.ovh website that suffers from CPU issues when doing a full cache clear to trigger the preload. It would be a good place to test it. See here.

Warning This branch would need https://github.com/wp-media/wp-rocket/issues/6394 Otherwise, we don't have control to prevent flooding the AS queue and the number of job in-progress could increase too quickly.

Khadreal commented 9 months ago

Improved the batch size work a bit from what @MathieuLamiot did, added transient for all requests and then used the value to determine the max and min size of the next batch requests.

MathieuLamiot commented 9 months ago

Thanks @Khadreal 🙏 I am not sure what would be the expected behavior for rocket_preload_previous_request_durations values 🤔 If I understand correctly:

It seems to me that, with the current code, rocket_preload_previous_request_durations starts at 0 and will increase without upper limit every 5 minutes, so every 5 minutes, we will reduce the speed of preload. You might be missing a "rolling average" mechanism? Or am I missing something?

A rolling average could be implemented as follows (it's not the best way to do it, but it's the quickest one):

Replace: $previous_request_durations = $previous_request_durations + $duration; With:

if ($previous_request_durations <= 0) {
    $previous_request_durations = $duration;
} else {
   $previous_request_durations = $previous_request_durations*0.7 + $duration*0.3;
}
MathieuLamiot commented 9 months ago

I cleaned up a few things and added a dedicated logic based on transient to limit the number of blocking requests to 1 per minute. It gives good results with my local. I am trying to test on gamma, where we should be able to see a preload going way slower thanks to this, currently blocked because I can't write with the FTP access 🤷

To easily monitor, I add the following:

error_log(sprintf('preload_url: duration %s averaged %s', $duration, $previous_request_durations)); after $check_duration = false;

error_log(sprintf('process_pending_jobs: batch size %s averaged %s', $next_batch_size, $preload_request_duration)); before $next_batch_size = min( $next_batch_size, $max_batch_size);

@piotrbak
In the the current branch:

While we finalize testing, we would need your inputs on:

MathieuLamiot commented 8 months ago

After running tests on gamma website and locally, I adjusted the formula so that we don't impact much "normal" website but provide a batch reduction when the website is slow (typically more than 3 or 4 seconds on average per request starts to reduce significantly the preload). I reduced the timeframe over which the transient of average duration is kept to allow to quickly adapt in case the website performances change quickly (which is the case with gamma for instance).

I opened a PR to keep track, but we'll need AC or at least NRT plans here, and some rework of the unit/integration tests. I manually tested as much as possible and preloads seems to be going well.

Just one question, as I am not sure about how Preload and RUCSS work together: if preload is slowed down (let's say batch size is 5 instead of 45), does it have any impact on the rate at which we'll add RUCSS jobs to the table and send them? I don't think so, but wanted a confirmation @wp-media/engineering-plugin-team

MathieuLamiot commented 8 months ago

@Khadreal Can you take over this issue for the completion?

Just one question, as I am not sure about how Preload and RUCSS work together: if preload is slowed down (let's say batch size is 5 instead of 45), does it have any impact on the rate at which we'll add RUCSS jobs to the table and send them? I don't think so, but wanted a confirmation @wp-media/engineering-plugin-team

MathieuLamiot commented 7 months ago

Summary of the functional behavior of the implemented solution, as of now

Functional behavior

The number of URLs to preload per batch becomes variable. It is now adjusted with the time it takes to load a page. This time is estimated by frequently measuring how long a preload requests takes, and doing an average over time. The impact is that, on websites where loading an uncached page takes more than 2 seconds, the batch size will be reduced and hence, preload will take longer.

Preparing a batch

When preparing a preload batch, the plugin computes the batch size based on rocket_preload_previous_request_durations transient (estimation of how long it takes to load a page). There are safeguards so that: the count of pending actions in AS is never above rocket_preload_cache_pending_jobs_cron_rows_count filter, and if possible, that the batch size is at least rocket_preload_cache_min_in_progress_jobs_count filter. In case there is no estimation available (first time using this feature, or first preload since at least 5 minutes), then the batch size is the minimum one by default.

Sending preload requests

When sending a preload request, if it has been more than 1 minute since the last estimation, we make the request blocking and measure how long it takes to return. The measured time is used to update the rocket_preload_previous_request_durations transient, with an expiration of 5 minutes. When a new estimation is done, we set rocket_preload_check_duration transient, with an expiration of 60 seconds. As long as this transient is set, no new estimation will occur.

Controlling the feature

Currently, this feature is applied by default.

Bypassing the feature

Bypassing the feature means having a constant preload batch size. In the current implementation, to do this, one must set those filters to the same value, being the desired preload batch size: rocket_preload_cache_min_in_progress_jobs_count, rocket_preload_cache_pending_jobs_cron_rows_count. Note that the estimation of the loading time will still be performed.

List of filters

List of transients

MathieuLamiot commented 4 months ago

@piotrbak Are there changes required to release this compared to the functional description in my last comment above?

piotrbak commented 4 months ago

@MathieuLamiot I think there's nothing to be added here. Just to confirm, if we set rocket_preload_cache_min_in_progress_jobs_count to be equal to the rocket_preload_cache_pending_jobs_cron_rows_count, the preload theoretically won't be slowed down even if the response time is big, right?

MathieuLamiot commented 4 months ago

Just to confirm, if we set rocket_preload_cache_min_in_progress_jobs_count to be equal to the rocket_preload_cache_pending_jobs_cron_rows_count, the preload theoretically won't be slowed down even if the response time is big, right?

Yes, correct.

Are you OK with the following:

piotrbak commented 4 months ago
At first, the batch size will be set to rocket_preload_cache_min_in_progress_jobs_count, so slow at the beginning (conservative approach).

But then, how it'll be increased to the regular size if the request time is less than 2s? In what steps?

@DahmaniAdame pinging you about the 2s loading time, if it's bigger, we treat the website as heavy loaded. Also the 5 as a number of minimal batch for loaded websites.

DahmaniAdame commented 4 months ago

A starting batch of 5 is reasonable. It will help low-resource setups to not get overloaded after activating preload, and build up if there are enough resources to process more. From what I can see, the 2 seconds is more of a scenario and not a cap. And it's likely what's going to be for most of our users. It might be slow, but going beyond will result in a high load. The setting can still be overruled by the batch size filter I assume? just to give a way to bypass it for users still feeling it's not enough and understanding the resources impact resulting from not using the automated/recommended preload pace.

MathieuLamiot commented 4 months ago

The setting can still be overruled by the batch size filter I assume?

See this:

Bypassing the feature Bypassing the feature means having a constant preload batch size. In the current implementation, to do this, one must set those filters to the same value, being the desired preload batch size: rocket_preload_cache_min_in_progress_jobs_count, rocket_preload_cache_pending_jobs_cron_rows_count. Note that the estimation of the loading time will still be performed.

One would be able to force a batch size regardless of the loading time reported. We cannot change the 2s value currently if someone wants to "adapt the formula".

MathieuLamiot commented 4 months ago

To whoever picks this up, this needs: