Consider a no-eval mode (e.g. store token array)

Krinkle commented 4 years ago

To reduce operational risk for end-users and simplify security auditing, I think it would be interesting to have an alternate mode in LightnCandy where compile() returns an optimised array of tokens instead of a string of executable PHP code.

E.g. instead of

(string)"return 'Hello! '.LR::encq($cx, (($inary && isset($in['name'])) ? $in['name'] : null)).' is '.LR::encq($cx, (($inary && isset($in['gender'])) ? $in['gender'] : null)).'.;"

It would return something like:

(array)['Hello! ', [TOKEN_ENCQ, 'name'], ' is ', [TOKEN_ENCQ, 'gender'], '.']

This could then be recognised by render() and it would then basically do the same as what the executable code does. But, with one extra step, which would be a (cheap) translation from the token to the same function calls that would otherwise appear in the executable string, and concatenate the pieces.

I am aware this would sacrifice performance on the benchmark, so I would not propose this as replacement or even default, but merely as an option for users for whom the (hopefully small) difference is acceptable.

Would this be of interest? I am willing to help with the implementation :)

Krinkle commented 4 years ago

@zordius I'd love a response on the above. I can try to schedule it for early next year to work – before upgrading LightNCandy for MediaWiki.

zordius commented 4 years ago

Sorry for replying late.

The core concept of lightncandy is to compile the handlebars code into pure PHP code for performance, so I did not design the token output for the compile() method.

But, for internally job, there is a median layer of tokens inside the lightncandy. I do not know how people can use it for what, but it still can be extracted. Here is an example code:

require('./vendor/autoload.php');
use LightnCandy\LightnCandy;

// The Template:
$template = <<<VAREND
Hello, {{world}}
{{#if foo}}
  Yes, it's good.
{{/if}}
VAREND;

$phpStr = LightnCandy::compile($template, array(
  // Used compile flags
  'flags' => LightnCandy::FLAG_HANDLEBARS | LightnCandy::FLAG_ERROR_EXCEPTION,
));

echo "Generated PHP Code:\n$phpStr\n";

echo "Dump of token:\n";
print_r(LightnCandy::$lastParsed);

The token data is complex and not documented, I think it may not be helpful for you.

zordius commented 4 years ago

I just realize the 'no-eval' topic. Basically, I prefer people to use LightnCandy as pre-compile all template then execute the generated code later pattern. How ever, to include a PHP file generated by lightncandy is a kind of eval() or not, it's an interesting question.

jenschude commented 4 years ago

Every template engine with logic support is some kind of eval. But typically you should have static templates. When someone tries to support dynamic templates given as input by users than you can‘t help them anyway

Krinkle commented 4 years ago

@zordius I think it is okay if it is not stable across versions. I believe currently, we also need to regenerate compiled code after upgrades?

If I work on making a performant way to execute the token array, and maybe changing the internal token format to make this possible (if needed), would you be open toward such a feature being part of lightncandy? e.g. ::compileTokens and ::executeTokens, or something like that.

@jenschude There is no PHP executable code in Mustache templates. All executable code is within the lightncandy library itself. The only difference is whether we serialise if ($ctx->foo) echo $this->escape($ctx->foo) as a string of PHP code, or as a token that says [COND, PROP, "foo"], [ESCAPE, PROP, "foo"]. The former means it is near-impossible to audit or safeguard. If the code is changed to execute something else anywhere in PHP, it could. The latter is limited to a subset of methods that lightncandy understands the tokens for. Which means if I audit the lightncandy code and accept the methods it calls, I can accept all execution of templates also.

There may be a small penalty in performance, but I believe this can be minimised greatly. I also believe that within the server-side, performance is not as important for a template library. (I say this whilst being a member of the Performance Team for Wikipedia's software.)

I would expect that the number of template calls for a typical web request to be far less than a million. Even a large complex code base with millions of templates would not execute all of them for one user at once, or at the very least be able to cache the output of some of them between users. As such, 1% or 5% slowdown is unlikely to have an impact of even one millisecond. I might prove myself wrong, but I remain optimistic until then :)

jenschude commented 4 years ago

It‘s counter intuitive. PHP has been created as a template language. So any template engine on top is a little bit ridiculous. But still it makes sense if you compile it to pure PHP.

Twig does it. Smarty does it and the ones not compiling it to a cached version are not relevant any more cause they are way to slow.

And what i meant is that typically your templates live as a static file in your code and under your control like your PHP scripts themself. And best it would be even compiled and cached before deployment.

So there should be no need to render using tokens.

jenschude commented 4 years ago

What I forgot to add. Handlebars is a reduced logic template language. Branching is either done using data or helpers. So again under your control.

zordius commented 4 years ago

Ya, personally I do not like the realtime token executer pattern. The difference between 'token mode' or 'compiled code mode' is just performance, which all these 'EXECUTED CODE' is limited from lightncandy library. for example:

token mode: I see {{FOO}} token , then run LC::v('FOO')
compiled code: run compiled LC::v('FOO') , the code was generated by lightncandy.

Unless any study found a vulnerability inside lightncandy that it can generate unexpected code from handlebars design, or you can think it should be safe to run the sandboxed code, no matter by token mode or by compiled code.

Krinkle commented 4 years ago

Below is a simplified summary of our current situation. This is very simplified, so do ask more question if something sounds strange.

Preference to not commit compiled code files.
Policy of not allowing web server to create PHP files (program should not be trusted or allowed to create new executables during run-time).
Preference to have easy deployment and automatic propagation as-needed. (We use cache-warmup scripts).

The problem is not whether there is a security problem in lightncandy today. Instead, I believe it is a good security policy to not allow execution of unknown PHP code or execution of unreviewed PHP files.

This has led to a strange compromise for our initial trial deployment of lightncandy. We decided to store PHP code inside Memcached (or Redis), and then give it to eval. (With reduced risk through secret keys and HMAC validation).

My proposal is to stop this. Instead, we can cache a token array (e.g. simple enough to be JSON-compatible) after the compilation/optimiser step. There is no executable code here. Then, instead of serialising to PHP code, store it directly. Then, feed that at run-time. This means all executable code is statically reviewable, and then can be deployed. This provides much easier to understand behaviour and security guarantees. Performance is important. But, I believe that at normal production use, the difference between these models will be insignificant. At least as an option?

I can implement it in a fork, but would prefer to collaborate and share with other users.

zordius commented 4 years ago

I got the context, and I agree the token mode is your only solution. Based on my previous comment, the token runner can do tasks by parsed tokens inside LightnCandy::$lastParsed , it should be a hard work to create a runner. It can be a new API, I think the new API will not harm to current performance mode.

zordius commented 4 years ago

And, just a reference, lightncandy was a competitor of handlebars.php, which is a token mode handlebars solution, and we get 10~ 50 times faster than it. My thinking is: the token mode lightncandy should be slower than compiled mode much. Basically it is caused by the wrapped code of real execution:

foreach ($tokens as $token) {
  switch (type($token)) {
    case xxx
       real_execute_xxx(...)
     ...
  }
}

jenschude commented 4 years ago

I think the main impact in terms of performance of the compiled code comes from the OpCache. The compiled code will be hold in the shared memory of the OpCache. Whereas an Token-Structure or AST would have to be rebuild in memory on every request.

It could be optimized by writing it out as a static array. But this is limited to arrays with a maximum of 65K entries (when i remember it right). See https://blog.blackfire.io/speeding-up-autoloading-on-php-5-6-7-0-for-everyone.html

zordius / lightncandy

Consider a no-eval mode (e.g. store token array) #322