zendtech / ZendOptimizerPlus

Other
914 stars 142 forks source link

Feature Request: Store duplicate files once #164

Open NickStallman opened 10 years ago

NickStallman commented 10 years ago

I host hundreds of Wordpress sites and the biggest problem with Opcode caches is that they store the core Wordpress files hundreds of times.

I can give APC 2 gigs of ram at the moment and it takes less than a minute to fill completely and a lot of it are the same files over and over again.

There are hacks like tricking the opcode cache with chroots (causing file path hash collisions) and using softlinks to the files but they have security issues and problems when updating any of the files..

In theory, if you did a hash of the PHP file and stored the compiled opcodes under the hash then any file with the identical hash would point to the same compressed opcodes fixing this issue. As far as I can tell the only security flaw with it would be if someone could do on demand hash collisions which would be pretty tricky for a modern hash algorithm, and they'd need to know the exact contents of the file in the other hosting account.

Hashing the PHP files would incur a speed penalty, but it shouldn't need to be done very often (only with file changes) and the benefit when hosting many similar files more than makes up for it. Of course it should be a option that is possibly disabled by default.

Are there any plans for this? Or any blocking downsides which I'm not thinking about?

rlerdorf commented 10 years ago

Couldn't you have a read-only shared include dir for these common files? You could even use a 2-path include_path so if a site wanted to customize a particular file and not run the common one they could just copy it to their local include directory and it would get included instead of the common one.

Calculating a hash is super expensive, so on startup or on a cache-full scenario where opcache dumps the entire cache you would be in a world of hurt trying to re-hash every file. Basically that isn't feasible so it would need some sort of 2-stage cache priming/loading mechanism which gets complicated quickly.

But to answer your direct question. No, there are no plans for this that I am aware of.

dstogov commented 10 years ago

There are no plans to implement it. I think, it may be possible to achieve the result using tricky configuration. e.g. mount (-o bind,readonly) wordpress directory into the same directory in all chroots. Of course it's not simple and error prone.

Thanks. Dmitry.

On Fri, Jan 31, 2014 at 8:37 AM, Rasmus Lerdorf notifications@github.comwrote:

Couldn't you have a read-only shared include dir for these common files? You could even use a 2-path include_path so if a site wanted to customize a particular file and not run the common one they could just copy it to their local include directory and it would get included instead of the common one.

Calculating a hash is super expensive, so on startup or on a cache-full scenario where opcache dumps the entire cache you would be in a world of hurt trying to re-hash every file. Basically that isn't feasible so it would need some sort of 2-stage cache priming/loading mechanism which gets complicated quickly.

But to answer your direct question. No, there are no plans for this that I am aware of.

— Reply to this email directly or view it on GitHubhttps://github.com/zendtech/ZendOptimizerPlus/issues/164#issuecomment-33761262 .

NickStallman commented 10 years ago

Yeah there are nasty hacks like that. The biggest problem is when you go to upgrade the sites. It isn't really possible to upgrade that many at once and upgrading some files but not others could do bad things.

A deduping filesystem would kind of do the right thing transparently but there is no way to tell the opcache that a file has been deduped.

A hash is expensive particularly at startup but probably not as expensive as burning that much RAM. Since the cache doesn't get cleared too often (At least with APC, I hear Zend's operates differently) I know what I'd prefer.

I might get around to trying to hack something to do what I want and actually measure any performance penalty. I've only got a small amount of PHP internals knowledge though so I doubt I could make a quality patch to contribute back.

dstogov commented 10 years ago

Your are welcome with "prove of concept" patch

On Fri, Jan 31, 2014 at 2:04 PM, NickStallman notifications@github.comwrote:

Yeah there are nasty hacks like that. The biggest problem is when you go to upgrade the sites. It isn't really possible to upgrade that many at once and upgrading some files but not others could do bad things.

A deduping filesystem would kind of do the right thing transparently but there is no way to tell the opcache that a file has been deduped.

A hash is expensive particularly at startup but probably not as expensive as burning that much RAM. Since the cache doesn't get cleared too often (At least with APC, I hear Zend's operates differently) I know what I'd prefer.

I might get around to trying to hack something to do what I want and actually measure any performance penalty. I've only got a small amount of PHP internals knowledge though so I doubt I could make a quality patch to contribute back.

— Reply to this email directly or view it on GitHubhttps://github.com/zendtech/ZendOptimizerPlus/issues/164#issuecomment-33774001 .

TerryE commented 10 years ago

@NickStallman, Sorry nick but I don't share your views here. I regard the use of resolved filenames as feature rather than a flaw. Any scheme for folding multiple resolved filenames onto a single OPcache content either involves opening all sorts of exploitable security vulnerabilities or performance issues because of the need to read in the content of each script file to create a strong hash/digest key.

You can use symlinks to do all that you require in Wordpress, as I have done in the past with MediaWiki and phpBB. I described how I set up the OpenOffice user forums this way in the OOo wiki (Standardisation Strategy, Detailed Implementation Notes).

Essentially you can set up "template" WordPress versions, e.g. in /lib/wordpress/ver381 and symlink either individual files or entire directories into the relevant version. The synlinked paths are resolved and hence all WordPress instances use the same cached compiled scripts.

Mapping individual user instances onto the correct version simply requires a symlink script to set up all the symlinks, which takes two input parameters: the user's docroot and the target version. In the case of phpBB, I'd dress rehearse this in a dev environment and the upgrade script did any cache directory cleanups, any version-specific config file edits and the batch script to update the DB schema and any DB tweaks. Upgrading the entire EN (English) forum DB only took ~30secs for a 3.0.x version upgrade, so I didn't even bother with a formal downtime. The script simply knocked the forum offline, did the upgrade, and brought it back on again. I never had any complaints, as people expect the odd 30s brownout for Internet apps.


As to your underlying suggestion, I've got a slow burn development of my MLC OPcache which uses a shared local socket connected memcached compiled script cache, and where OPCache maintains a per-rootdir mapping file of resolved filename -> SHA2 moniker. It does what you want at roughly ~90% of the performance acceleration of current OPcache, but the chances of this code, even when I promote it to public access, ever getting into core OPcache are slim to zero.

Gormartsen commented 8 years ago

@NickStallman I can recommend you to use Xcache We are in the same situation in our company, when we hosts thousands drupal websites and xcache handle it very well.

@dstogov you can find POC in xcache. It works very well and when we upgraded from 1.x to 2.x version, it saved more that 50% memory on cache storage.

@TerryE until you have separated clients for each website (web hosting case) and there is almost impossible to force clients to use shared files. It makes migration process really painful.