Closed Markup-Fanatic closed 2 months ago
Right. Okay. Achim and I chatted about this the other day.
DTD validation is by definition something that has to be performed on a sequence of characters. You can't "DTD validate" an XML document that's already been parsed. That's incoherent.
Here's a back-of-the-envelope design:
<p:declare-step type="p:validate-with-dtd">
<p:input port="source" primary="true" content-types="xml html text"/>
<p:output port="result" primary="true" content-types="xml"/>
<p:output port="report" sequence="true" content-types="xml json"/>
<p:option name="doctype-public" select="()" as="xs:string?"/>
<p:option name="doctype-system" as="()" as="xs:string?"/>
<p:option name="assert-valid" select="true()" as="xs:boolean"/>
<p:option name="entities" as="map(xs:string,item())?"/>
<p:option name="report-format" select="'xvrl'" as="xs:string"/>
<p:option name="serialization" as="map(xs:QName,item()*)?"/>
</p:declare-step>
If entities
is non-empty, each entity must be either a string or a map containing a doctype-system
key and an optional doctype-public
key.
If the source
is XML or HTML, it's serialized with a <!DOCTYPE
declaration and declarations for the declared entities
in the internal subset. If the source
is text, it's the callers responsibility to format the text with a <!DOCTYPE
declaration and other features.
The serialized document is parsed with an XML parser. The implementation may seralize the document "in memory" or on disk at its discretion. If assert-valid
is true, a validating parser must be used. If assert-valid
is false, a non-validating parser must be used.
It is an error to provide doctype-public
without doctype-system
. It is an error to specify that assert-valid
is true without providing a doctype-system
value.
I'm not sure what to do about the base URI if the input doesn't have one. I guess you just get errors if any of the system ID references are not absolute.
This is probably incomplete, but it's a start.
If assert-valid is true, a validating parser must be used. If assert-valid is false, a non-validating parser must be used.
I do not think this is consistent with the other validation steps. To my understanding "assert-valid" = false with the other steps means: Validate the document, but do not raise an error. So it is the authors duty to check the "report"-port. For p:validate-with-dtd setting "assert-valid" to false would mean: Do not validate.
If I got this right: I would prefer to have "assert-valid" aligned with the other validation steps.
A (DTD) validating parser treats validity errors as fatal (IIRC). That means you don't get a complete parse if validation fails. In the interest of alignment, I guess we could say that if validation fails, the document is re-parsed with a non-validating parser and the errors returned by the first parsing attempt are reported if assert-valid
is false.
@ndw Thanks. I did not know that the parse is not continued. Knowning this, your suggestion completely make sense!
Which suggestion? That assert-valid
means something slightly different, or that the step should parse twice?
Oh, sorry. The original one makes now sense to me. I would prefer the "parse twice" suggestion a bit more.
I think we can justify parsing twice. The only reason to call this step is to perform DTD validation and the only way to get a report and a document is to parse twice. Given that the implementation must have arranged for there to be a sequence of characters to parse, parsing them twice doesn't seem unduly burdonsome. And if that's a performance problem, the user has explicitly asked for it.
Hey, shouldn't this be moved to the VNext stuff?
We don't have a VNext repository for steps and we already have a bunch of step proposals that are incomplete, so I don't see the harm in adding new ones here.
We don't have a VNext repository for steps and we already have a bunch of step proposals that are incomplete, so I don't see the harm in adding new ones here.
Well, it's not really important, but it's also not entirely true. The VNext repo was setup for all things that might be interesting for a new version, including steps. This was explicitly mentioned in the README. See https://github.com/xproc/VNext
My oversight then. I don't mind moving it if you'd prefer.
Proposal: add a doctype
port on which a single text document may occur. If provided, it is the doctype for the input. If doctype
is provied on a port, then doctype-public
, doctype-system
, and entities
are forbidden.
Remove the doctype-system
and doctype-public
options, require them to be in the serialization options.
Resolved by #579
As a XProc developer I want native and performant DTD support for validation so that I don't need to find a workaround by myself by implementig a with temporary storing, loading (and validating) and deleting my result
Acceptance criteria