whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.18k stars 2.69k forks source link

adding new identifier to specify loading character entities in XHTML #500

Open davidcarlisle opened 8 years ago

davidcarlisle commented 8 years ago

This issue is a continuation of the discussion started in an issue on html-build, see

https://github.com/whatwg/html-build/issues/42#issuecomment-170000019

for further context.

As currently specified in XHTML,

https://html.spec.whatwg.org/multipage/xhtml.html#parsing-xhtml-documents

to trigger definition of the character entities you have specify the public identifier of a DTD that (as defined) is incompatible with the character definitions in the current HTML spec.

The suggestion is to allow one more identifier, in the above comment @annevk suggested web-entities rather than the long FPI syntax originally suggested.

In the upstream xml-entities spec I have updated the comments in htmlmathml-f.ent to reflect this suggestion.

I'm not wedded to that name, if a different name is preferred I'd change the distributed entity definition file to match. See the diff at

https://github.com/w3c/xml-entities/commit/59df3c0260780a5beea06d1a8f78d1ba7e22abfc#diff-91aa5c243fe976a2a456dc12cd885c5d

Adding this identifier would allow a catalog set up that allows files to be used in an XML workflow and in a web browser with matching character definitions.

annevk commented 8 years ago

What do you mean by standards mode? As far as I know that is a concept that only applies to HTML (text/html).

davidcarlisle commented 8 years ago

Sorry, nonsense, I'll edit.

annevk commented 8 years ago

So, another way of viewing this is that the HTML Standard redefines what these DTD public identifiers mean and that their original definition is now moot/obsolete. That seems to be the interpretation of implementations. However, I personally don't see much of a problem adding one identifier that is less tainted by legacy.

zcorpan commented 8 years ago

This is the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409 which includes rationale for why this was rejected. What has changed that makes this a good idea now?

zcorpan commented 8 years ago

Per https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409#c3 it seems like it might be more compatible with deployed content and some old browsers to support SYSTEM "mathml.dtd" instead of a new FPI.

cc @hsivonen

davidcarlisle commented 8 years ago

On 11 January 2016 at 17:59, Simon Pieters notifications@github.com wrote:

This is the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409 which includes rationale for why this was rejected. What has changed that makes this a good idea now?

That was dropped as following W3C procedure as they were at the time I not only had to publish

https://www.w3.org/2003/entities/2007doc/xhtmlpubid.html

I had to re-publish updated versions for each stage of the W3C rec trac process every few months until it was implemented, and I decided that was unreasonable way to ask for a one line change.

So the issue is as it always was, nothing has changed, it came up now as I mentioned that change request in the original html-build issue and Anne did half of it (using space not U+00A0 in the documentation). And suggested raising an issue here as a route for pushing the remaining part.

domenic commented 8 years ago

Yeah it seems fine that if at least a couple browsers are content with adding a new doctype to that list, then we could certainly do so in the spec. So "addition/proposal" + "needs implementer interest" seems accurate to me.

zcorpan commented 8 years ago

I downloaded Firefox 2 and tested it a bit. Here's what it seems to do to enable the entity set based on "mathml.dtd" in the SI:

Using the SI for this has the benefit of being able to use a FPI that is compatible with current browsers, or whatever FPI may be necessary for other XML tools one is interested in supporting. And <!DOCTYPE html SYSTEM "mathml.dtd"> seems easy enough to remember.

zcorpan commented 8 years ago
SELECT COUNT(*) AS num, REGEXP_EXTRACT(body, r'(<\!DOCTYPE\s+[a-zA-Z0-9:-]+\s+(?:PUBLIC\s+["\'][^"\'>]+["\']|SYSTEM)\s+["\'](?:[^"\'>]*\/)?[Mm][Aa][Tt][Hh][Mm][Ll]\.[Dd][Tt][Dd]["\']\s*>)') AS doctype
FROM [httparchive:runs.2014_08_15_requests_body] 
WHERE mimeType CONTAINS "xml"
GROUP BY doctype
ORDER BY num DESC;

No matches in httparchive.

https://github.com/search?q=doctype+"mathml+dtd"+&ref=searchresults&type=Code&utf8=✓

11,451 code results, of which 331 are .xml files (though some of the others are PHP etc that might be used to serve XML). From a quick look they appear to be using all-lowercase "mathml.dtd" but sometimes point to mathml.dtd on w3.org.

Concrete proposal, given the above:

zcorpan commented 8 years ago

cc @fred-wang