wp-media / wp-rocket

Performance optimization plugin for WordPress
https://wp-rocket.me
GNU General Public License v2.0
687 stars 215 forks source link

CDN RegEx picks up relative URLs when query strings have a specially-formatted value #2849

Closed vmanthos closed 2 years ago

vmanthos commented 4 years ago

When the source code includes:

  1. relative URLs, e.g. action='/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047'>
  2. containing a query string with a value formatted like this: string_1.string2

the RegEx matches that.

Here is an example from a customer's website: https://regex101.com/r/VgMfih/1

When that's being used alongside WPML or Polylang, it can result in rewriting of the URL and, if it's used in, e.g. a form, its functionality can break.

To Reproduce

Steps to reproduce the behavior:

  1. Use WPML or Polylang and set the language as a subfolder, e.g. https://example.com/fr/
  2. Add the following code on the website: action='/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047'>
  3. Enable the CDN.
  4. Check the source code. The previous URL will be rewritten to: https://cdn.example.com/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047

Expected behavior

That kind or URL shouldn't be rewritten.

Additional context

This is certainly an edge case. Excluding URLs like /fr/(.*) from CDN delivery, resolved the issue.

Related ticket: https://secure.helpscout.net/conversation/1212168255/177075/

@crystinutzaa said:

Backlog Grooming (for WP Media dev team use only)

vmanthos commented 4 years ago

A similar case where a path was incorrectly picked-up by the RegEx: https://regex101.com/r/W11blI/1

Related ticket: https://secure.helpscout.net/conversation/1224949532/180202/

piotrbak commented 4 years ago

Another case with different markup: https://regex101.com/r/VgMfih/2

Related ticket: https://secure.helpscout.net/conversation/1258416414/188277/

vmanthos commented 4 years ago

Another case: https://secure.helpscout.net/conversation/1261279926/189183?folderId=3864740

The RegEx: https://regex101.com/r/7BqIxG/1

NataliaDrause commented 4 years ago

Related ticket: https://secure.helpscout.net/conversation/1273353523/192227?folderId=3864740

Screenshot of the rewritten source code: https://jumpshare.com/v/Fv7Llh8ZkUw90vnGjjbF

vmanthos commented 3 years ago

Related ticket: https://secure.helpscout.net/conversation/1294825358/198155/

code: https://snippi.com/s/csw9sby

/6971396.js is rewritten to the CDN.

DahmaniAdame commented 3 years ago

Similar issue - https://secure.helpscout.net/conversation/1395595665/231172/

s.src = 'https://webchat.missiveapp.com/' + w.MissiveChatConfig.id + '/missive.js';

Turns to:

s.src = 'https://webchat.missiveapp.com/' + w.MissiveChatConfig.id + 'https://cdn.ext/missive.js';

vmanthos commented 3 years ago

Related ticket: https://secure.helpscout.net/conversation/1401254760/232728/

if (/filter/.test(window.location.href)) {

becomes:

if (https://234qasd.rocketcdn.me/filter/.test(window.location.href)) {

Excluding the following fixes the issue: /filter(.*)

crystinutzaa commented 3 years ago

Reproduce the problem ✅ Issue reproduced on local testing

Identify the root cause ✅ The root cause is inside this Regex which matches everything which follows the format: https://github.com/wp-media/wp-rocket/blob/c88e099454ba5b1e2dba3e6cd46936f0525856ed/inc/Engine/CDN/CDN.php#L44 Starts with either ( " ' Continues with a / Has the pattern: any-character.(DOT)any-character Ends with either ' " )

This can match a lot of code inside JS or actions: Action example: <form method='post' enctype='multipart/form-data' id='gform_2' action='/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047'> Inside script example : <script type="text/javaript">a (function(d,s,i,r) { if (d.getElementById(i)){return;} var n=d.createElement(s),e=d.getElementsByTagName(s)[0]; n.id=i;n.src='//js.hs-analytics.net/analytics/'+(Math.ceil(new Date()/r)*r)+ '/wp-content/4019605.js'; e.parentNode.insertBefore(n, e); })(document,"script","hs-analytics",300000); </script>

Scope a solution ✅ @wp-media/php There is no way to simply modify the regex to disallow these types of data. The solution which I can identify at this point is to prevent CDN to rewrite the URL if this match is inside an action or inside a script. The solution is to modify the code in here and prevent the replacing of the url with the CDN URL in case the matched value starts with action or is inside a script tag: https://github.com/wp-media/wp-rocket/blob/0455940481777146a4677dbf59ed928223351639/inc/Engine/CDN/CDN.php#L48

For script this Regex might help to identify if the matched URL is inside a script tag: (<script\s*((.|\n)*?)\s*(?<url>'/wp-content/4019605\.js')\s*((.|\n)*?)\s*<\/script\s*)

For action this Regex might help to identify if the matched URL starts with action=: (action\s*=\s*(?<url>'/fr/\?_ga=2\.213437008\.512656707\.1593765841-1928543133\.1539673047'))

However at this point I am concerned this is sort of a bandaid and this might fail to replace some valid URLs inside JS code.

@wp-media/php do you see any other possible solutions for this?

Estimate the effort 🔴

Tabrisrp commented 3 years ago

@wp-media/productrocket Currently we see 2 approaches to fix the issue.

  1. Skip the <script> tags when doing the CDN rewrite.
  2. Not rewriting when the URL is relative.

Downside of (1) is that we might be missing URLs inside scripts, that won't be rewritten to the CDN URL. Downside of (2) is that any relative URL won't be rewrittent at all.

We looked at CDN Enabler to compare, and they have the same issue as us whenever rewritting for relative paths is enabled.

From a product perspective, is there an approach you prefer?

GeekPress commented 3 years ago

Thanks, @Tabrisrp for sharing the solutions.

We will go with (2) as it will still allow us to rewrite absolute URLs and fix the current issue.

When (1) will fix the current issue but create a regression.

piotrbak commented 3 years ago

Issues that might be related/solved by fixing this one: https://github.com/wp-media/wp-rocket/issues/3138 https://github.com/wp-media/wp-rocket/issues/2322 https://github.com/wp-media/wp-rocket/issues/3103 https://github.com/wp-media/wp-rocket/issues/3718 https://github.com/wp-media/wp-rocket/issues/3416

crystinutzaa commented 3 years ago

@wp-media/productrocket & @Tabrisrp The desired solution is to not rewrite the relative URLs, so we have 3 options on how to approach this change:

  1. In this case, should we add another option under CDN options to enable / disable this option?
  2. Or we shall enable it by default ( to exclude the CDN rewrite of relative URLs ) and add a filter which will allow to disable it?
  3. Or it should preserve the current functionality and add a filter which will disable the CDN rewrite of relative URLs ?

😄

crystinutzaa commented 3 years ago

And I can confirm that the issues: https://github.com/wp-media/wp-rocket/issues/3138 https://github.com/wp-media/wp-rocket/issues/2322 will be fixed by this change. However for the https://github.com/wp-media/wp-rocket/issues/3103 I am not sure 100% if this is caused by the relative paths.

Also, there could be another solution: treat only scripts differently. Basically, we can rewrite everything with CDN url except the content of scripts. And in scripts we can rewrite only absolute paths and exclude from rewriting relative paths.

This solution will be harder to implement and will bring much more code complexity. Also, some relative paths would still need to be converted to CDN url, which we might miss it with this change.

GeekPress commented 3 years ago

Let's keep the fix simple, do the change, and get feedback to know if we have to iterate with a filter or implement a harder solution. An option for that is a no-go ^^

Thanks, @piotrbak for reminding us of other issues!

crystinutzaa commented 3 years ago

Scope a NEW solution ✅

Modify the regex to disallow relative URLs for CDN re-writing.

Estimate the effort ✅ [S]

webtrainingwheels commented 3 years ago

https://secure.helpscout.net/conversation/1508703973/262930?folderId=377611 This is also related to the same Hubspot script as this case: https://github.com/wp-media/wp-rocket/issues/2849#issuecomment-701253120

vmanthos commented 3 years ago

Ticket: https://secure.helpscout.net/conversation/1522244512/266312?folderId=2135277

The script was the following:

<script type="text/javascript">
(function(d,s,i,r) {
if (d.getElementById(i)){return;}
var n=d.createElement(s),e=d.getElementsByTagName(s)[0];
n.id=i;n.src='//js.hs-analytics.net/analytics/'+(Math.ceil(new Date()/r)*r)+'/1867782.js';
e.parentNode.insertBefore(n, e);
})(document,"script","hs-analytics",300000);
</script>

Our RegEx picked up /1867782.js.

WordPresseur commented 2 years ago

Related ticket: https://secure.helpscout.net/conversation/1652747895/299668?folderId=4075513

On the following:

test('flex_tests',
flex_tests,
env: [
'CMOCKA_MESSAGE_OUTPUT=XML',
'CMOCKA_XML_FILE=' + meson.build_root() + '/test/%g.xml']
)

run_target('flex-tests',
command: [flex_tests]
)

Our RefEx picked up /test/%g.xml

NataliaDrause commented 2 years ago

Related: https://secure.helpscout.net/conversation/1654443869/300056?folderId=3864740

Screenshot provided by the customer: image

piotrbak commented 2 years ago

I believe our new RocketCDN plugin and also WP Rocket would benefit from as good as possible approach to this topic.

@Tabrisrp Do you see any negative sides of not rewriting relative paths inside the internal scripts?

Tabrisrp commented 2 years ago

@piotrbak There will be some missed cases when doing that, but that can be an ok trade-off to avoid the kind of issues we have otherwise.

piotrbak commented 2 years ago

@Tabrisrp That sounds like a good trade-off. We'll leave a filter for customers that would like to rewrite relative paths inside scripts. Do we have a possibility to measure how this change affects the processing time, or you think that the change will not be noticeable?

Acceptance Criteria:

  1. We will not rewrite relative paths inside inline scripts to CDN
  2. Absolute paths (not sure if there will be any cases like that) will still be rewritten
  3. We will introduce a filter that will allow to bypass this enhancement and rewrites will work as before
Tabrisrp commented 2 years ago

Scope a solution ✅

After discussion, we will start of with changes to ignore inline scripts from the HTML (with a filter to allow it), and also add a filter to completely disable rewriting of relative paths (default to enabled)

This will require changes in CDN::rewrite():

Estimate the effort ✅

Effort [M], will need to adapt the fixtures and tests also