xproc / 3.0-steps

Repository for change requests to the standard step library and for official extension steps
10 stars 7 forks source link

add <p:validate-with-dtd> #543

Closed Markup-Fanatic closed 2 months ago

Markup-Fanatic commented 1 year ago

As a XProc developer I want native and performant DTD support for validation so that I don't need to find a workaround by myself by implementig a with temporary storing, loading (and validating) and deleting my result

Acceptance criteria

ndw commented 1 year ago

Right. Okay. Achim and I chatted about this the other day.

DTD validation is by definition something that has to be performed on a sequence of characters. You can't "DTD validate" an XML document that's already been parsed. That's incoherent.

Here's a back-of-the-envelope design:

<p:declare-step type="p:validate-with-dtd">
     <p:input port="source" primary="true" content-types="xml html text"/>
     <p:output port="result" primary="true" content-types="xml"/>
     <p:output port="report" sequence="true" content-types="xml json"/>
     <p:option name="doctype-public" select="()" as="xs:string?"/>
     <p:option name="doctype-system" as="()" as="xs:string?"/>
     <p:option name="assert-valid" select="true()" as="xs:boolean"/>
     <p:option name="entities" as="map(xs:string,item())?"/>     
     <p:option name="report-format" select="'xvrl'" as="xs:string"/>
     <p:option name="serialization" as="map(xs:QName,item()*)?"/>
</p:declare-step>

If entities is non-empty, each entity must be either a string or a map containing a doctype-system key and an optional doctype-public key.

If the source is XML or HTML, it's serialized with a <!DOCTYPE declaration and declarations for the declared entities in the internal subset. If the source is text, it's the callers responsibility to format the text with a <!DOCTYPE declaration and other features.

The serialized document is parsed with an XML parser. The implementation may seralize the document "in memory" or on disk at its discretion. If assert-valid is true, a validating parser must be used. If assert-valid is false, a non-validating parser must be used.

It is an error to provide doctype-public without doctype-system. It is an error to specify that assert-valid is true without providing a doctype-system value.

I'm not sure what to do about the base URI if the input doesn't have one. I guess you just get errors if any of the system ID references are not absolute.

This is probably incomplete, but it's a start.

xml-project commented 6 months ago

If assert-valid is true, a validating parser must be used. If assert-valid is false, a non-validating parser must be used.

I do not think this is consistent with the other validation steps. To my understanding "assert-valid" = false with the other steps means: Validate the document, but do not raise an error. So it is the authors duty to check the "report"-port. For p:validate-with-dtd setting "assert-valid" to false would mean: Do not validate.

If I got this right: I would prefer to have "assert-valid" aligned with the other validation steps.

ndw commented 6 months ago

A (DTD) validating parser treats validity errors as fatal (IIRC). That means you don't get a complete parse if validation fails. In the interest of alignment, I guess we could say that if validation fails, the document is re-parsed with a non-validating parser and the errors returned by the first parsing attempt are reported if assert-valid is false.

xml-project commented 6 months ago

@ndw Thanks. I did not know that the parse is not continued. Knowning this, your suggestion completely make sense!

ndw commented 6 months ago

Which suggestion? That assert-valid means something slightly different, or that the step should parse twice?

xml-project commented 6 months ago

Oh, sorry. The original one makes now sense to me. I would prefer the "parse twice" suggestion a bit more.

ndw commented 6 months ago

I think we can justify parsing twice. The only reason to call this step is to perform DTD validation and the only way to get a report and a document is to parse twice. Given that the implementation must have arranged for there to be a sequence of characters to parse, parsing them twice doesn't seem unduly burdonsome. And if that's a performance problem, the user has explicitly asked for it.

xatapult commented 6 months ago

Hey, shouldn't this be moved to the VNext stuff?

ndw commented 6 months ago

We don't have a VNext repository for steps and we already have a bunch of step proposals that are incomplete, so I don't see the harm in adding new ones here.

xatapult commented 6 months ago

We don't have a VNext repository for steps and we already have a bunch of step proposals that are incomplete, so I don't see the harm in adding new ones here.

Well, it's not really important, but it's also not entirely true. The VNext repo was setup for all things that might be interesting for a new version, including steps. This was explicitly mentioned in the README. See https://github.com/xproc/VNext

ndw commented 6 months ago

My oversight then. I don't mind moving it if you'd prefer.

ndw commented 4 months ago

Proposal: add a doctype port on which a single text document may occur. If provided, it is the doctype for the input. If doctype is provied on a port, then doctype-public, doctype-system, and entities are forbidden.

Remove the doctype-system and doctype-public options, require them to be in the serialization options.

ndw commented 2 months ago

Resolved by #579