yous / whiteglass

Minimal, responsive Jekyll theme for hackers
https://yous.github.io/whiteglass/
MIT License
732 stars 201 forks source link

HTML-encode the ampersand in the URL #32

Closed petdance closed 5 years ago

petdance commented 5 years ago

My mistake, it needs to be HTML-encoded, not URL-encoded.

Yes, there is a problem with the unencoded ampersand. Consider:

$ cat foo.html
<a href="http://example.com/?this&that">
$ tidy -qe foo.html
line 1 column 34 - Warning: unescaped & or unknown entity "&that"

The & in the URL has to be encoded as &amp; just like any other ampersand in the HTML document.

Note that this change does not change the URL. It's simply encoding it correctly.

yous commented 5 years ago

There is an ambiguous ampersand section in HTML5 spec.

An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more alphanumeric ASCII characters, followed by a ";" (U+003B) character, where these characters do not match any of the names given in the named character references section.

But there is no matching named character &d;, &di;, ..., &display, etc. HTML 4.01 is also similar with HTML5: https://mathiasbynens.be/notes/ambiguous-ampersands.

I'll try tidy-html5 against this repository. As it generates some warnings, I'll consider replacing single & to &amp; to be explicit.

yous commented 5 years ago

This repository uses html5validator which is based on The Nu Html Checker (v.Nu), it generates errors with following HTML:

<link href="https://fonts.googleapis.com/css?family=Bitter:400,400i,700&copy" rel="stylesheet">
$ html5validator _site/2017/01/02/my-example-post/index.html
ERROR:html5validator.validator:"file:/Users/yous/src/whiteglass/_site/2017/01/02/my-example-post/index.html":40.1-40.72: error: The string following "&" was interpreted as a character reference. ("&" probably should have been escaped as "&amp;".)

As a named character reference &copy; exists. But with the original URL, html5validator doesn't generate errors.

petdance commented 5 years ago

Character entities have to have a semicolon at the end.

If the &copy in the URL is supposed to be the copyright symbol entity, then it should have the semicolon at the end, as &copy;. And if it's not supposed to be the copyright symbol entity, then it should be &amp;copy.

yous commented 5 years ago

Yes, right. I meant, if there was an ambiguous ampersand in URL, then html5validator would give some errors. But the actual URL doesn't contain ambiguous ampersand as there is no &dis;, &disp;, etc.

I tried some snippets:

  1. With <a title="&copy=foo">link</a>, html5validator doesn't give errors, the hover text is &copy=foo.
  2. With <a title="&copy;=foo">link</a>, html5validator doesn't give errors, the hover text is ©=foo.
  3. With <a title="&copy">link</a>, html5validator gives an error, the hover text is ©.
  4. With <a title="&copyfoo">link</a>, html5validator doesn't give errors, the hover text is &copyfoo.

So when the entity is not explicitly a character entity but will be parsed as a character entity, html5validator will give errors.

I think using html5validator is enough, are there something more to consider or am I missing something?

petdance commented 5 years ago

I understand that &display=swap is not ambiguous, and that browsers may handle it OK. Still, the correct way to do it is with &amp;display=swap. I'm not seeing a reason not to.

This instance in fonts.html is the only instance of this problem that I've seen.

yous commented 5 years ago

Okay. There are not so many ampersands, merging now.