wpsharks / comet-cache

An advanced WordPress® caching plugin inspired by simplicity.
https://cometcache.com
GNU General Public License v3.0
75 stars 17 forks source link

Suggest .htaccess rules to prevent some erroneous cache directories #101

Closed raamdev closed 7 years ago

raamdev commented 10 years ago

During my testing of the new branched cache structure on a live site, I found that after several days my cache had the following directories:

69-16-219-214/
ftp-raamdev-com/
raamdev-com/
RAAMDEV-COM/
raamdev-comhttp/

Some of these should not exist, notably the uppercase RAAMDEV-COM and raamdev-comhttp.

jaswrks commented 10 years ago

@raamdev I think this behavior is correct on the part of QC. If the host changes, a separate cache should be kept for it since it's always possible that the host name would impact the content generated server-side. Even though a default WP install might do fine against its configured host name, you never know what else might be running on that server and/or via custom themes/plugins; which might alter the final output based on the host name in the request.

You mentioned to me before that you thought a possible solution might be to offer site owners an .htaccess snippet that would help them avoid some common issues associated with different host names. The code snippet below could be generated dynamically via PHP and presented to a site owner in the Dashboard with some recommendations. I think it should be optional.

This should resolve...

# BEGIN Host Enforcer
<IfModule rewrite_module>
    RewriteEngine on
    RewriteBase /

    RewriteCond %{HTTP_HOST} !^example\.com$
    RewriteCond %{HTTPS} !^on$ [NC]
    RewriteCond %{HTTP:X-Forwarded-Proto} !^https$ [NC]
    RewriteRule .* http://example.com%{REQUEST_URI} [R=301,L]

    RewriteCond %{HTTP_HOST} !^example\.com$
    RewriteCond %{HTTPS} ^on$ [NC,OR]
    RewriteCond %{HTTP:X-Forwarded-Proto} ^https$ [NC]
    RewriteRule .* https://example.com%{REQUEST_URI} [R=301,L]
</IfModule>
# END Host Enforcer
raamdev commented 10 years ago

You mentioned to me before that you thought a possible solution might be to offer site owners an .htaccess snippet

Thanks! I had actually forgotten that we discussed that. Yes, I agree that's the best approach.

raamdev-comhttp This is IMO the most troublesome issue. This folder was nested inside of your raamdev-com directory right?

No, it wasn't nested. It was at the same level as raamdev-com. Here's what that directory structure looks like right now:

wp-content/cache/http/raamdev-comhttp/raamdev-com/

Inside which I have:

2006/
2007/
2010/
2011/
raamdev-com/

And every cache file in those sub-directories is, as would be expected, a 404 (symlinked back to the default 404 file).

I tried searching my apache logs for any requests matching some of the 404s, but came up empty. I'll leave this open for now and do some more testing on my live site. I'll also defer this issue for a future release, as I don't feel it's important enough to get out right away.

raamdev commented 10 years ago

Just a quick update on this: I've been running Quick Cache Pro (from April 16th) for the past two weeks, along with the .htaccess code you recommended above for the Host Enforcer, on my raamdev.com site and my wp-content/cache/http/ directory as of today as these subdirectories:

raamdev-com/
RAAMDEV-COM/
raamdev-comhttp/
raamdev-comHTTP/
raamdev-comhttps/

I'm installing the latest Quick Cache Pro as of today and will continue testing.

I realized that this issue with erroneous directories might also have something to do with 404 Caching, as if an invalid URL is requested, Quick Cache will create the necessary subdirectories in to make the symlink to the 404 cache file. With 404 Caching disabled (the default), I bet these erroneous directories would go away.

I'll let it run for a few days and then test again with 404 Caching disabled.

raamdev commented 10 years ago

I've had the latest dev version running for the past few days (404 Caching enabled). I have the following subdirectories in wp-content/cache/quick-cache/cache/http/

raamdev-com/
raamdev-comhttp/raamdev-com/

Inside raamdev-comhttp/raamdev-com/ I have lots of subdirectories and cache files, all of which are 404 Cache files that point back to the default 404 cache file. For example, I have the following:

raamdev-comhttp/raamdev-com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app.html (symlink)
raamdev-comhttp/raamdev-com/2014/try-it-a-different-way.html (symlink)

The actual working URLs are here: http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ http://raamdev.com/2013/try-it-a-different-way/

I dug through my Apache access logs looking for the 404 errors to see if there was something funky about the GET request, but here they both are and the GET requests look correct (notice both of these are from the same IP address).

122.96.59.106 - - [04/May/2014:16:19:21 -0400] "GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ HTTP/1.0" 404 35383
122.96.59.106 - - [04/May/2014:16:21:35 -0400] "GET http://raamdev.com/2013/try-it-a-different-way/ HTTP/1.0" 404 35383

(I'm assuming these are the corresponding requests based on the fact the date and timestamps match up to when the 404 symlinks were created.)

What's odd to me is that Apache returned a 404 when the request looks like it should go through. I mean, if you copy and paste those two URLs into your browser, they won't return a 404 but rather the post they're supposed to return.

@JasWSInc Any idea what might be going on here? Or any thoughts about how else I can attempt to figure out what's going on here?


I'm going to disable 404 Caching now and let it run for a few more days just to verify that this issue goes away with 404 Caching disabled.

jaswrks commented 10 years ago

Regarding these two log entries in your Apache log...

122.96.59.106 - - [04/May/2014:16:19:21 -0400] "GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ HTTP/1.0" 404 35383
122.96.59.106 - - [04/May/2014:16:21:35 -0400] "GET http://raamdev.com/2013/try-it-a-different-way/ HTTP/1.0" 404 35383

These actually look wrong to me, but it might just be the Apache log format you're using. Could you check on this? Ordinarily, an HTTP request includes a Host: header and of course the request itself is aimed at a particular IP address that is resolved during the request.

The GET request itself should not include a host name, only the path to a file that is expected to live on that host. So what I would expect to see in the log file is....

GET /2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/
GET /2013/try-it-a-different-way/
jaswrks commented 10 years ago

In short, when I see GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ in the log file, it looks to me like the URL that was requested was actually...

http://raamdev.comhttp://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/
raamdev commented 10 years ago

The GET request itself should not include a host name,

Ah, yes, you're absolutely right. I was looking at way too many log entries and didn't catch that. The GET request should not contain the hostname.

So, this looks like it's just an invalid request and there's not a whole lot that we can do about that, correct?

I just tried reproducing this, both in a browser and via the command line using curl, but in both cases visiting http://raamdev.comhttp://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ doesn't work... because http://raamdev.comhttp is an invalid domain.

I'm curious how such a request ever made it through to WordPress where Quick Cache picked it up.

jaswrks commented 10 years ago

So, this looks like it's just an invalid request and there's not a whole lot that we can do about that, correct?

Right. I'm not aware of a way to stop this. It's just a 404 error really.

I'm curious how such a request ever made it through to WordPress where Quick Cache picked it up.

Here's how you can reproduce it. These requests are most likely coming from a bot, it would be very difficult to reproduce this in a browser. Instead of building a URL, think about the underlying HTTP communication that would occur if you made this request without using a URL; and instead you simply opened a socket that sends an invalid GET request with the correct Host: header. That's really what a browser does anyway, but it parses the URL that you give it, and in that case the Host: would be wrong. However, if you remove the URL from the equation, and instead connect to the host IP and issue a GET request through a script, you can reproduce this.

<?php
error_reporting(-1);
ini_set('display_errors', TRUE);

$raamdev_ip = gethostbyname('raamdev.com'); // Resolve to an IP address.
$connection = fsockopen($raamdev_ip, 80, $errno, $errstr, 30); // Open connection.

if(!$connection) echo $errstr.' ('.$errno.')<br />'."\n";

else // We have a connection to `$raamdev_ip:80`. We're good so far.
    {
        /*
         * BuildS a GET request that is intentionally invalid in this case.
         */
        $request = 'GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ HTTP/1.1'."\r\n";
        // ↑ this is intentionally invalid; it should be `/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/`.
        $request .= 'Host: raamdev.com'."\r\n"; // Apache virtual host @ `$raamdev_ip:80`.
        $request .= 'Connection: Close'."\r\n\r\n";

        /*
         * Talk to the IP handling `raamdev.com`.
         */
        fwrite($connection, $request);

        /*
         * Get the response; a 404 in this case.
         */
        while(!feof($connection))
            echo fgets($connection, 128);

        /*
         * Close the connection.
         */
        fclose($connection);
    }
jaswrks commented 10 years ago

So, this looks like it's just an invalid request and there's not a whole lot that we can do about that, correct?

One thing you could do is investigate any reports from Google Webmaster Tools for raamdev.com that may indicate you have some invalid links on your site; i.e. invalid relative locations within a document that a spider might pick up by mistake.

For example, if you have an <a href=""> tag that might confuse a spider, you could see more than your fair share of invalid requests like this. The bot is simply following what you give it, and if that's wrong you get hit with lots of 404 errors when it attempts to spider your site.

That said, this can happen even if you don't have any invalid links on the site. Some spiders just don't function properly. They get things wrong when they spider your site. You could scan your log files and try to find a bot that is consistently doing this to you; then ban it using a robots.txt file or other means.

raamdev commented 10 years ago

Here's how you can reproduce it.

Thanks for explaining that and for the sample code. That really helped clarify a few things for me. :) I tested that script and it does exactly as you said; it recreates the raamdev-comhttp/ directory that I was seeing along with the 404 symlink.

I'm not aware of a way to stop this. It's just a 404 error really.

Got it. We'll just offer an .htaccess file that helps clean things up a bit, if the site owner wants to implement something like that.

I think it will also be a good idea to explain that with 404 caching enabled, any invalid request will result in the cache file symlink being created, just so that there's no confusion about why there are cache directories for seemingly invalid hosts. In fact, I can probably turn a lot of what we've talked about here into a wiki article and reference that right form the inline docs.

raamdev commented 10 years ago

Punting this to the Future Release milestone.

raamdev commented 9 years ago

@mchlbrry writes in #288:

As a side note for some reason im getting multiple domain cache directories being generated, potentially related? eg:

www-domain-com www-domain-com- www-domain-comhttp These directories hang around too after using the 'clear cache' option in WP

Those are the result of a slightly misconfigured web server. Quick Cache uses the PHP $_SERVER['REQUEST_URI'] variable to determine the cache directory path it should build when generating and saving the cache file. If a request has a malformed Request URI, then Quick Cache will end up creating odd directories like that.

Where do these requests come from? Well, a search engine bot that scans large amounts of sites could itself be misconfigured and make bad requests, which Quick Cache picks up and attempts to cache.

The best way around this issue is to create an .htaccess rule that tells the web server to always redirect any bad requests to the proper Request URI. Please see Jason's first reply at the top of this issue for an example .htaccess rule.

sallyfarmer commented 9 years ago

Ever since I installed zen cache at http://alcohol-abuse-and-addictions-agency.co.uk unless an .htaccess is not present at file manager the whole site is down. Even with the .htaccess deleted and the site is up none of the page or posts links work. permalinks is set to %postnames% at the bottom of the list. The .htaccess regularly re-appears and when it does the site comes down. I cannot just delete zencache as now i don't want to be further messed up. I have aws account with a created distribution correctly setup as per your excellent video with a cname at cpanel cdn. etc. i USED THE

BEGIN Host Enforcer

RewriteEngine on RewriteBase / ``` RewriteCond %{HTTP_HOST} !^example\.com$ RewriteCond %{HTTPS} !^on$ [NC] RewriteCond %{HTTP:X-Forwarded-Proto} !^https$ [NC] RewriteRule .* http://example.com%{REQUEST_URI} [R=301,L] RewriteCond %{HTTP_HOST} !^example\.com$ RewriteCond %{HTTPS} ^on$ [NC,OR] RewriteCond %{HTTP:X-Forwarded-Proto} ^https$ [NC] RewriteRule .* https://example.com%{REQUEST_URI} [R=301,L] ```

END Host Enforcer

by replacing example with alcohol-abuse-and-addictions-agency.co.uk within it being careful that it was exactly the same bar the href but it didn't work and the only way i could get the site to show again was by deleting the .htaccess file agin completely. Still no links work but the cloudfront aws is very fast in rendering links that don't work

raamdev commented 9 years ago

@sallyfarmer It sounds like you may have an error in your .htaccess file, or a misconfiguration on your web server. I recommend contacting your web hosting company and asking them why the .htaccess file isn't working, as they have access to the server logs and they can diagnose this issue.

jaswrks commented 8 years ago

@raamdev I'm just noting that this is another candidate for our new .htaccess tweaks system.

renzms commented 7 years ago

Tested for Site using NGINX

screen shot 2017-02-03 at 8 05 56 pm

Also tried adding the following manually for sites that use NGINX:

# Enforce exact host name.
<IfModule rewrite_module>
    RewriteEngine on
    RewriteBase /

    RewriteCond %{HTTP_HOST} !^domain.net$
    RewriteCond %{HTTPS} !^on$ [NC]
    RewriteCond %{HTTP:X-Forwarded-Proto} !^https$ [NC]
    RewriteRule .* http://domain.net%{REQUEST_URI} [R=301,L]

    RewriteCond %{HTTP_HOST} !^domain.net$
    RewriteCond %{HTTPS} ^on$ [NC,OR]
    RewriteCond %{HTTP:X-Forwarded-Proto} ^https$ [NC]
    RewriteRule .* https://domain.net%{REQUEST_URI} [R=301,L]
</IfModule>

was unable to continue testing as there were problems with server detection please see comment here

renzms commented 7 years ago

@raamdev

Confirmed Working

Tested Using:

WordPress Version: 4.7.2 Current WordPress Theme: Twenty Seventeen version 1.1 Theme Author: the WordPress team - https://wordpress.org/ Theme URI: https://wordpress.org/themes/twentyseventeen/ Active Plugins: Comet Cache Pro Version 170209-RC PHP Version: 7.0.10 MySQL Version: 10.0.29-MariaDB-0ubuntu0.16.04.1 Apache Version: Apache/2.4.10 (Debian)

Tested using different incorrect web addresses/ made up subdomains; such as http://foo.bar.php70-renz.wpsharks.net/path, the IP itself, and ftp.php70-renz.wpsharks.net.

screen shot 2017-02-13 at 5 25 41 pm

No erroneous cache directories:

screen shot 2017-02-13 at 5 27 44 pm

raamdev commented 7 years ago

Comet Cache v170220 has been released and includes changes from this GitHub Issue. See the v170220 announcement for further details.


This issue will now be locked to further updates. If you have something to add related to this GitHub Issue, please open a new GitHub Issue and reference this one (#101).