adding new identifier to specify loading character entities in XHTML

davidcarlisle commented 8 years ago

This issue is a continuation of the discussion started in an issue on html-build, see

https://github.com/whatwg/html-build/issues/42#issuecomment-170000019

for further context.

As currently specified in XHTML,

https://html.spec.whatwg.org/multipage/xhtml.html#parsing-xhtml-documents

to trigger definition of the character entities you have specify the public identifier of a DTD that (as defined) is incompatible with the character definitions in the current HTML spec.

The suggestion is to allow one more identifier, in the above comment @annevk suggested web-entities rather than the long FPI syntax originally suggested.

In the upstream xml-entities spec I have updated the comments in htmlmathml-f.ent to reflect this suggestion.

I'm not wedded to that name, if a different name is preferred I'd change the distributed entity definition file to match. See the diff at

https://github.com/w3c/xml-entities/commit/59df3c0260780a5beea06d1a8f78d1ba7e22abfc#diff-91aa5c243fe976a2a456dc12cd885c5d

Adding this identifier would allow a catalog set up that allows files to be used in an XML workflow and in a web browser with matching character definitions.

annevk commented 8 years ago

What do you mean by standards mode? As far as I know that is a concept that only applies to HTML (text/html).

davidcarlisle commented 8 years ago

Sorry, nonsense, I'll edit.

annevk commented 8 years ago

So, another way of viewing this is that the HTML Standard redefines what these DTD public identifiers mean and that their original definition is now moot/obsolete. That seems to be the interpretation of implementations. However, I personally don't see much of a problem adding one identifier that is less tainted by legacy.

zcorpan commented 8 years ago

This is the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409 which includes rationale for why this was rejected. What has changed that makes this a good idea now?

zcorpan commented 8 years ago

Per https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409#c3 it seems like it might be more compatible with deployed content and some old browsers to support SYSTEM "mathml.dtd" instead of a new FPI.

cc @hsivonen

davidcarlisle commented 8 years ago

On 11 January 2016 at 17:59, Simon Pieters notifications@github.com wrote:

This is the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409 which includes rationale for why this was rejected. What has changed that makes this a good idea now?

—

That was dropped as following W3C procedure as they were at the time I not only had to publish

https://www.w3.org/2003/entities/2007doc/xhtmlpubid.html

I had to re-publish updated versions for each stage of the W3C rec trac process every few months until it was implemented, and I decided that was unreasonable way to ask for a one line change.

So the issue is as it always was, nothing has changed, it came up now as I mentioned that change request in the original html-build issue and Anne did half of it (using space not U+00A0 in the documentation). And suggested raising an issue here as a route for pushing the remaining part.

domenic commented 8 years ago

Yeah it seems fine that if at least a couple browsers are content with adding a new doctype to that list, then we could certainly do so in the spec. So "addition/proposal" + "needs implementer interest" seems accurate to me.

zcorpan commented 8 years ago

I downloaded Firefox 2 and tested it a bit. Here's what it seems to do to enable the entity set based on "mathml.dtd" in the SI:

It resolves the SI and does a case-insensitive compare for "mathml.dtd" of the last path component.
If it fails to resolve the URL it doesn't enable the entities.
Also this doesn't appear to work at all for data: URL documents.

Using the SI for this has the benefit of being able to use a FPI that is compatible with current browsers, or whatever FPI may be necessary for other XML tools one is interested in supporting. And <!DOCTYPE html SYSTEM "mathml.dtd"> seems easy enough to remember.

zcorpan commented 8 years ago

SELECT COUNT(*) AS num, REGEXP_EXTRACT(body, r'(<\!DOCTYPE\s+[a-zA-Z0-9:-]+\s+(?:PUBLIC\s+["\'][^"\'>]+["\']|SYSTEM)\s+["\'](?:[^"\'>]*\/)?[Mm][Aa][Tt][Hh][Mm][Ll]\.[Dd][Tt][Dd]["\']\s*>)') AS doctype
FROM [httparchive:runs.2014_08_15_requests_body] 
WHERE mimeType CONTAINS "xml"
GROUP BY doctype
ORDER BY num DESC;

No matches in httparchive.

https://github.com/search?q=doctype+"mathml+dtd"+&ref=searchresults&type=Code&utf8=✓

11,451 code results, of which 331 are .xml files (though some of the others are PHP etc that might be used to serve XML). From a quick look they appear to be using all-lowercase "mathml.dtd" but sometimes point to mathml.dtd on w3.org.

Concrete proposal, given the above:

If the system identifier is the string "mathml.dtd" or ends with the string "/mathml.dtd" (both case-sensitive), then that corresponds to the spec's "the URL given by this link" DTD.

zcorpan commented 8 years ago

cc @fred-wang

whatwg / html

adding new identifier to specify loading character entities in XHTML #500