Closed vmanthos closed 2 years ago
A similar case where a path was incorrectly picked-up by the RegEx: https://regex101.com/r/W11blI/1
Related ticket: https://secure.helpscout.net/conversation/1224949532/180202/
Another case with different markup: https://regex101.com/r/VgMfih/2
Related ticket: https://secure.helpscout.net/conversation/1258416414/188277/
Related ticket: https://secure.helpscout.net/conversation/1273353523/192227?folderId=3864740
Screenshot of the rewritten source code: https://jumpshare.com/v/Fv7Llh8ZkUw90vnGjjbF
Related ticket: https://secure.helpscout.net/conversation/1294825358/198155/
code: https://snippi.com/s/csw9sby
/6971396.js
is rewritten to the CDN.
Similar issue - https://secure.helpscout.net/conversation/1395595665/231172/
s.src = 'https://webchat.missiveapp.com/' + w.MissiveChatConfig.id + '/missive.js';
Turns to:
s.src = 'https://webchat.missiveapp.com/' + w.MissiveChatConfig.id + 'https://cdn.ext/missive.js';
Related ticket: https://secure.helpscout.net/conversation/1401254760/232728/
if (/filter/.test(window.location.href)) {
becomes:
if (https://234qasd.rocketcdn.me/filter/.test(window.location.href)) {
Excluding the following fixes the issue:
/filter(.*)
Reproduce the problem ✅ Issue reproduced on local testing
Identify the root cause ✅
The root cause is inside this Regex which matches everything which follows the format:
https://github.com/wp-media/wp-rocket/blob/c88e099454ba5b1e2dba3e6cd46936f0525856ed/inc/Engine/CDN/CDN.php#L44
Starts with either ( " '
Continues with a /
Has the pattern: any-character.(DOT)any-character
Ends with either ' " )
This can match a lot of code inside JS or actions:
Action example: <form method='post' enctype='multipart/form-data' id='gform_2' action='/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047'>
Inside script example : <script type="text/javaript">a (function(d,s,i,r) { if (d.getElementById(i)){return;} var n=d.createElement(s),e=d.getElementsByTagName(s)[0]; n.id=i;n.src='//js.hs-analytics.net/analytics/'+(Math.ceil(new Date()/r)*r)+ '/wp-content/4019605.js'; e.parentNode.insertBefore(n, e); })(document,"script","hs-analytics",300000); </script>
Scope a solution ✅
@wp-media/php There is no way to simply modify the regex to disallow these types of data.
The solution which I can identify at this point is to prevent CDN to rewrite the URL if this match is inside an action
or inside a script
.
The solution is to modify the code in here and prevent the replacing of the url with the CDN URL in case the matched value starts with action
or is inside a script
tag:
https://github.com/wp-media/wp-rocket/blob/0455940481777146a4677dbf59ed928223351639/inc/Engine/CDN/CDN.php#L48
For script
this Regex might help to identify if the matched URL is inside a script
tag:
(<script\s*((.|\n)*?)\s*(?<url>'/wp-content/4019605\.js')\s*((.|\n)*?)\s*<\/script\s*)
For action
this Regex might help to identify if the matched URL starts with action=
:
(action\s*=\s*(?<url>'/fr/\?_ga=2\.213437008\.512656707\.1593765841-1928543133\.1539673047'))
However at this point I am concerned this is sort of a bandaid and this might fail to replace some valid URLs inside JS code.
@wp-media/php do you see any other possible solutions for this?
Estimate the effort 🔴
@wp-media/productrocket Currently we see 2 approaches to fix the issue.
<script>
tags when doing the CDN rewrite.Downside of (1) is that we might be missing URLs inside scripts, that won't be rewritten to the CDN URL. Downside of (2) is that any relative URL won't be rewrittent at all.
We looked at CDN Enabler to compare, and they have the same issue as us whenever rewritting for relative paths is enabled.
From a product perspective, is there an approach you prefer?
Thanks, @Tabrisrp for sharing the solutions.
We will go with (2) as it will still allow us to rewrite absolute URLs and fix the current issue.
When (1) will fix the current issue but create a regression.
Issues that might be related/solved by fixing this one: https://github.com/wp-media/wp-rocket/issues/3138 https://github.com/wp-media/wp-rocket/issues/2322 https://github.com/wp-media/wp-rocket/issues/3103 https://github.com/wp-media/wp-rocket/issues/3718 https://github.com/wp-media/wp-rocket/issues/3416
@wp-media/productrocket & @Tabrisrp The desired solution is to not rewrite the relative URLs, so we have 3 options on how to approach this change:
😄
And I can confirm that the issues: https://github.com/wp-media/wp-rocket/issues/3138 https://github.com/wp-media/wp-rocket/issues/2322 will be fixed by this change. However for the https://github.com/wp-media/wp-rocket/issues/3103 I am not sure 100% if this is caused by the relative paths.
Also, there could be another solution: treat only scripts differently. Basically, we can rewrite everything with CDN url except the content of scripts. And in scripts we can rewrite only absolute paths and exclude from rewriting relative paths.
This solution will be harder to implement and will bring much more code complexity. Also, some relative paths would still need to be converted to CDN url, which we might miss it with this change.
Let's keep the fix simple, do the change, and get feedback to know if we have to iterate with a filter or implement a harder solution. An option for that is a no-go ^^
Thanks, @piotrbak for reminding us of other issues!
Scope a NEW solution ✅
Modify the regex to disallow relative URLs for CDN re-writing.
Estimate the effort ✅ [S]
https://secure.helpscout.net/conversation/1508703973/262930?folderId=377611 This is also related to the same Hubspot script as this case: https://github.com/wp-media/wp-rocket/issues/2849#issuecomment-701253120
Ticket: https://secure.helpscout.net/conversation/1522244512/266312?folderId=2135277
The script was the following:
<script type="text/javascript">
(function(d,s,i,r) {
if (d.getElementById(i)){return;}
var n=d.createElement(s),e=d.getElementsByTagName(s)[0];
n.id=i;n.src='//js.hs-analytics.net/analytics/'+(Math.ceil(new Date()/r)*r)+'/1867782.js';
e.parentNode.insertBefore(n, e);
})(document,"script","hs-analytics",300000);
</script>
Our RegEx picked up /1867782.js
.
Related ticket: https://secure.helpscout.net/conversation/1652747895/299668?folderId=4075513
On the following:
test('flex_tests',
flex_tests,
env: [
'CMOCKA_MESSAGE_OUTPUT=XML',
'CMOCKA_XML_FILE=' + meson.build_root() + '/test/%g.xml']
)
run_target('flex-tests',
command: [flex_tests]
)
Our RefEx picked up /test/%g.xml
Related: https://secure.helpscout.net/conversation/1654443869/300056?folderId=3864740
Screenshot provided by the customer:
I believe our new RocketCDN plugin and also WP Rocket would benefit from as good as possible approach to this topic.
@Tabrisrp Do you see any negative sides of not rewriting relative paths inside the internal scripts?
@piotrbak There will be some missed cases when doing that, but that can be an ok trade-off to avoid the kind of issues we have otherwise.
@Tabrisrp That sounds like a good trade-off. We'll leave a filter for customers that would like to rewrite relative paths inside scripts. Do we have a possibility to measure how this change affects the processing time, or you think that the change will not be noticeable?
Acceptance Criteria:
After discussion, we will start of with changes to ignore inline scripts from the HTML (with a filter to allow it), and also add a filter to completely disable rewriting of relative paths (default to enabled)
This will require changes in CDN::rewrite()
:
rocket_cdn_inline_scripts
value$relative_path_match
, default value empty, and corresponding to the part of the current RegEx inside a condition checking for the new filter rocket_cdn_relative_paths
valuepreg_replace_callback
by preg_match_all
and a loop, to be able to match against the buffer but perform the replacement on the original HTMLEffort [M]
, will need to adapt the fixtures and tests also
When the source code includes:
action='/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047'>
string_1.string2
the RegEx matches that.
Here is an example from a customer's website: https://regex101.com/r/VgMfih/1
When that's being used alongside WPML or Polylang, it can result in rewriting of the URL and, if it's used in, e.g. a form, its functionality can break.
To Reproduce
Steps to reproduce the behavior:
https://example.com/fr/
action='/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047'>
https://cdn.example.com/fr/?_ga=2.213437008.512656707.1593765841-1928543133.1539673047
Expected behavior
That kind or URL shouldn't be rewritten.
Additional context
This is certainly an edge case. Excluding URLs like
/fr/(.*)
from CDN delivery, resolved the issue.Related ticket: https://secure.helpscout.net/conversation/1212168255/177075/
@crystinutzaa said:
/
and is inside quotes.Backlog Grooming (for WP Media dev team use only)