parse_url() invalid behaviour with UTF-8 on macOS

webignition / url

Represents a URL, a library to be used in many other places. Applies semantically-lossless normalisation for comparisons.

MIT License

32 stars 6 forks source link

parse_url() invalid behaviour with UTF-8 on macOS #28

Closed abumostafa closed 6 years ago

abumostafa commented 6 years ago

Hello,

I have a problem using the library on macOS with utf-8 characters.

Here is an example

$uri = 'http://de.wikipedia.org/wiki/Nattō';
$urlParser = new \webignition\Url\Url($uri);
echo $urlParser->__toString();

The output is: Actual: http://de.wikipedia.org/wiki/Natt_ Expected: http://de.wikipedia.org/wiki/Natt%C5%8D

Here is an example of UTF-8 safe parsing

webignition commented 6 years ago

Cool, thanks for the report!

webignition commented 6 years ago

@abumostafa What version of PHP are you using?

abumostafa commented 6 years ago

Thanks for the quick reply. I'm using PHP 7.2.10 (cli) (built: Sep 14 2018 07:07:08) ( NTS ) installed using brew

webignition commented 6 years ago

Interesting, I'm not getting the quite same results. If I supply a string of 'http://de.wikipedia.org/wiki/Nattō', I'm getting the same back again.

Nevertheless, RFC 3986 is quite specific about what characters can be present in a URL. I'll work on a fix to percent-encode URL paths.

webignition commented 6 years ago

Should be all good now in release 2.2

abumostafa commented 6 years ago

Thanks a lot

abumostafa commented 6 years ago

I have tested your solution however it did not work :( the problem is here in parse_url When you call parse URL it returns wrong value

webignition commented 6 years ago

Seems to be working as far as I can tell based on the example you initially provided:

$url = new \webignition\Url\Url('http://de.wikipedia.org/wiki/Nattō');
var_dump((string)$url);
// http://de.wikipedia.org/wiki/Natt%C5%8D

That values returned by parse_url are not ultimately correct is not a concern as there is code that subsequently remedies the matter of otherwise-incorrect path encoding.

What gives you the impression that there is still an issue? Can you give me another example of what you expect vs what you get?

abumostafa commented 6 years ago

My tests are failing on macOS but pass on linux. I tested parse_url myself and it is the problem

Here is an example on macOS

php -r "print_r(parse_url('http://de.wikipedia.org/wiki/Nattō'));"

// macOs

PHP 7.2.10 (cli) (built: Sep 14 2018 07:07:08) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
    with Xdebug v2.6.1, Copyright (c) 2002-2018, by Derick Rethans
    with Zend OPcache v7.2.10, Copyright (c) 1999-2018, by Zend Technologies

Array
(
    [scheme] => http
    [host] => de.wikipedia.org
    [path] => /wiki/Natt�_
)

// linux

PHP 7.2.3 (cli) (built: Mar 22 2018 22:37:54) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
    with Zend OPcache v7.2.3, Copyright (c) 1999-2018, by Zend Technologies

Array
(
    [scheme] => http
    [host] => de.wikipedia.org
    [path] => /wiki/Nattō
)

webignition commented 6 years ago

Maybe your terminal is not able to render the relevant characters? I'm not able to find any substantiated issues with PHP itself to suggest that parse_url behaves incorrectly under macos.

Have you run the tests that come with this package?

abumostafa commented 6 years ago

Thanks a lot for your help.

I can confirm it's It's not the terminal.

Here is what i did:

I updated the package to 2.2
I ran the tests
I got the same error.

I did debug parse_url and checked the output. It's different between macOs and Linux. The results exactly as here https://github.com/webignition/url/issues/28#issuecomment-428866381

webignition commented 6 years ago

Looks like there is indeed a possible issue with parse_url() under OS X/mac OS. I say 'possible' as whether the matter constitutes a bug is somewhat open to interpretation. I'm not going to weigh in here as to whether the matter is or isn't a bug, that's up to the PHP maintainers to determine. There's not really anything I can do about that.

From looking at the above bug report, it seems like parse_url() was never intended to be multibyte-aware and that unreserved characters should be percent-encoded. By definition, http://de.wikipedia.org/wiki/Nattō is not valid by means of it containing unreserved characters that are not percent-encoded.

Implementing a change to handle unreserved unicode characters in a URL string brings along assumptions regarding the state of encoding of the URL being parsed. I'm not happy with such assumptions, and I'm not happy implementing and then maintaining a means for handling technically invalid URL strings.

If you correctly encode the URL to be parsed you'll not have any problems.

abumostafa commented 6 years ago

I totally agree with you the problem is parse_url is not multi-byte aware. and the problem seems to be platform and encoding specific. When i changed the OS locale to C instead of en_US.UTF-8 everything seems to work pretty much fine. So i would not bother about it anymore.

Thanks for taking the time to help and to support.

Feel free to close this issue