add <p:validate-with-dtd>

Markup-Fanatic commented 1 year ago

As a XProc developer I want native and performant DTD support for validation so that I don't need to find a workaround by myself by implementig a with temporary storing, loading (and validating) and deleting my result

Acceptance criteria

is provided

ndw commented 1 year ago

Right. Okay. Achim and I chatted about this the other day.

DTD validation is by definition something that has to be performed on a sequence of characters. You can't "DTD validate" an XML document that's already been parsed. That's incoherent.

Here's a back-of-the-envelope design:

<p:declare-step type="p:validate-with-dtd">
     <p:input port="source" primary="true" content-types="xml html text"/>
     <p:output port="result" primary="true" content-types="xml"/>
     <p:output port="report" sequence="true" content-types="xml json"/>
     <p:option name="doctype-public" select="()" as="xs:string?"/>
     <p:option name="doctype-system" as="()" as="xs:string?"/>
     <p:option name="assert-valid" select="true()" as="xs:boolean"/>
     <p:option name="entities" as="map(xs:string,item())?"/>     
     <p:option name="report-format" select="'xvrl'" as="xs:string"/>
     <p:option name="serialization" as="map(xs:QName,item()*)?"/>
</p:declare-step>

If entities is non-empty, each entity must be either a string or a map containing a doctype-system key and an optional doctype-public key.

If the source is XML or HTML, it's serialized with a <!DOCTYPE declaration and declarations for the declared entities in the internal subset. If the source is text, it's the callers responsibility to format the text with a <!DOCTYPE declaration and other features.

The serialized document is parsed with an XML parser. The implementation may seralize the document "in memory" or on disk at its discretion. If assert-valid is true, a validating parser must be used. If assert-valid is false, a non-validating parser must be used.

It is an error to provide doctype-public without doctype-system. It is an error to specify that assert-valid is true without providing a doctype-system value.

I'm not sure what to do about the base URI if the input doesn't have one. I guess you just get errors if any of the system ID references are not absolute.

This is probably incomplete, but it's a start.

xml-project commented 6 months ago

If assert-valid is true, a validating parser must be used. If assert-valid is false, a non-validating parser must be used.

I do not think this is consistent with the other validation steps. To my understanding "assert-valid" = false with the other steps means: Validate the document, but do not raise an error. So it is the authors duty to check the "report"-port. For p:validate-with-dtd setting "assert-valid" to false would mean: Do not validate.

If I got this right: I would prefer to have "assert-valid" aligned with the other validation steps.