wol-soft / php-json-schema-model-generator

Creates (immutable) PHP model classes from JSON-Schema files including all validation rules as PHP code
MIT License
59 stars 12 forks source link

Incorrectly escaped backslashes in regex in generated code #40

Closed dktapps closed 2 years ago

dktapps commented 2 years ago

Describe the bug

    },
    "main": {
      "description": "The fully-qualified name of the main class that extends PluginBase",
      "pattern": "([A-Za-z_]\\w+\\\\)*([A-Za-z_]\\w+)",
      "type": "string"
    },

This schema causes the following code to be generated:

                    if (is_string($value) && !preg_match('/([A-Za-z_]\w+\\)*([A-Za-z_]\w+)/', $value)) {
                        $this->_errorRegistry->addError(new \PHPModelGenerator\Exception\String\PatternException($value ?? null, ...array (
  0 => 'main',
  1 => '([A-Za-z_]\\w+\\\\)*([A-Za-z_]\\w+)',
)));
                    }

                return $value;
            }

This pattern is meant to allow backslashes, hence the 4 \\\\ (to match the literal backslash character).

You can see that the backslash given to preg_match() is incorrectly escaped: there are only 2 backslashes in the resulting string, which is interpreted by PHP as a single backslash even in a single-quoted string.

Expected behavior The backslash pattern above should be correctly escaped.

Schema plugin-schema.json.txt

Version: 0.21.0

dktapps commented 2 years ago

Digging, seems like this is quite an inconvenient problem to solve. My solution would be to embed the patterns as base64 blobs instead of as bare strings, and then generate code like preg_match(base64_decode('AAAAAAA===')) instead, so that PHP string escaping doesn't get a chance to screw anything up.

wol-soft commented 2 years ago

Hi @dktapps

Thanks for the report! I'll have a look into it later.

wol-soft commented 2 years ago

After some testig I've adopted your idea of using encoded strings rather than the regular expression itself in the generated code. I've also added a test case including a regular expression with backslashes´and forward slashes to test escaping and the en-/decoding.

dktapps commented 2 years ago

Additional escaping shouldn't be needed at all if you base64, no? It's only needed in the first place to prevent PHP interpreting it differently as a const string.

wol-soft commented 2 years ago

Yeah I thought the same way first but it's wrong (compare https://3v4l.org/YLRvS which matches a single backslash).

Also forward slashes must be escaped as they are used as delimiter of the regular expression.

dktapps commented 2 years ago

In the sample I gave, the 4 backslashes are intended to match 1 backslash (JSON \\\\ resolves to \\ when decoded, passed to PCRE, which then turns it into literal \). I'm just wondering if that might break something.

Also forward slashes must be escaped as they are used as delimiter of the regular expression.

preg_quote() might be better for that (specifically intended for purpose), though I don't suppose it matters much.

dktapps commented 2 years ago

I just tested dev-master with my schema + data and it works.. I'll never understand this godawful mess of backslashes. Thank you for the fix.

wol-soft commented 2 years ago

Yeah escaping is always hard, in this case on three layers. The JSON Schema, the PHP code and the PCRE engine. Glad it works now as expected.

wol-soft commented 2 years ago

I've thought about it again and you are absolutely right.

Four backslashes in the JSON mean two backslashes in the regular expression due to JSON escaping. Two backslashes in the regular expression must match one backslash in the tested string due to PCRE escaping. In the current implementation four backslashes in the JSON match two backslashes in the tested string which is wrong. It's so obscure.

But the delimiter escaping must be kept 😄

Edit: changed in https://github.com/wol-soft/php-json-schema-model-generator/commit/7dbf81a3c59ee87708168633e25760d8d912972b