zendframework / zend-filter

Filter component from Zend Framework
BSD 3-Clause "New" or "Revised" License
68 stars 36 forks source link

CamelCaseToSeparator unicode regexp not equivalent to classic. #87

Closed donaldinou closed 5 years ago

donaldinou commented 5 years ago

On the CamelCaseToSeparator class, the regexp is not the same if the unicode support is on or not (Actually the unicode is wrong). The result is that it will produce unexpected result if we're using unicode support.

// Current
// ...
if (StringUtils::hasPcreUnicodeSupport()) {
  $pattern     = ['#(?<=(?:\p{Lu}))(\p{Lu}\p{Ll})#', '#(?<=(?:\p{Ll}|\p{Nd}))(\p{Lu})#'];
  $replacement = [$this->separator . '\1', $this->separator . '\1'];
} else {
  $pattern     = ['#(?<=(?:[A-Z]))([A-Z]+)([A-Z][a-z])#', '#(?<=(?:[a-z0-9]))([A-Z])#'];
  $replacement = ['\1' . $this->separator . '\2', $this->separator . '\1'];
}
// ...
}

As you can see an uppercase bracket is missing.([A-Z+])

$pattern     = ['#(?<=(?:\p{Lu}))(\p{Lu}\p{Ll})#', '#(?<=(?:\p{Ll}|\p{Nd}))(\p{Lu})#'];

so it, should be

$pattern     = ['#(?<=(?:\p{Lu}))(\p{Lu}+)(\p{Lu}\p{Ll})#', '#(?<=(?:\p{Ll}|\p{Nd}))(\p{Lu})#'];
michalbundyra commented 5 years ago

@donaldinou Thanks for the report. Would you like provide PR with the hotfix and unit tests?

michalbundyra commented 5 years ago

@donaldinou I've investigated it, and I think it is invalid. For example:

TheseAre_SOME_CamelCASEDWords

Please note that $replacement array is different for both cases, and from my quick tests, changing pattern and replacement for non-unicode as shown below, works the same:

        if (StringUtils::hasPcreUnicodeSupport()) {
            $pattern     = ['#(?<=(?:\p{Lu}))(\p{Lu}\p{Ll})#', '#(?<=(?:\p{Ll}|\p{Nd}))(\p{Lu})#'];
            $replacement = [$this->separator . '\1', $this->separator . '\1'];
        } else {
            $pattern     = ['#(?<=(?:[A-Z]))([A-Z][a-z])#', '#(?<=(?:[a-z0-9]))([A-Z])#']; // removed one part
            $replacement = [$this->separator . '\1', $this->separator . '\1']; // the same as in the if statement now
        }

If you are experiencing an issue, please provide a failing test case, and then we will try to resolve it somehow. Thanks!

donaldinou commented 5 years ago

But it just that the regexp is not the same between unicode or not. The result should be the same if there is unicode activated or not. 3 months ago it was not.

Test TheseAre_SOME_CamelCASEDWords

For both unicode and not. Differents result (3 months ago no one works like you want)

michalbundyra commented 5 years ago

@donaldinou

3 months ago it was not.

There was no any change in that file for last 3 year...

If you can provide failing test case, please open another issue. Thanks!