xproc / 3.0-steps

Repository for change requests to the standard step library and for official extension steps
10 stars 7 forks source link

First cut at p:validate-with-dtd #579

Closed ndw closed 2 months ago

ndw commented 4 months ago

This my first attempt. Feedback eagerly solicited. Formatted versions should appear on the xproc.org/dashboard page a few minutes after I create this request.

ndw commented 4 months ago

Close #543

ndw commented 4 months ago

Thank you, Gerrit! You're absolutely right. The step can return the original document unchanged if there was an error. That's much more sensible.

ndw commented 4 months ago

I think we need a document-element option for the case where you send a text document as the source. For example:

<p:identity>
  <p:with-input><p>Paragraph of text.</p></p:with-input>
</p:identity>

<p:validate-with-dtd
  general-entities="map { 'text': 'Hello, world.',
                          'para': . }"
  document-element="doc">
  <p:with-input port="source">
    <p:inline content-type="text/plain"><![CDATA[<doc>
<p>Test</p>
<p>&text;</p>
&para;
</doc>]]></p:inline>
  </p:with-input>
  <p:with-input port="doctype"><p:empty/></p:with-input>
</p:validate-with-dtd>
ndw commented 4 months ago

Having poked at the implementation a bit, I think what I've proposed is way over-the-top. How about:

<p:declare-step type="p:validate-with-dtd">
  <p:input port="source" primary="true" content-types="xml html text"/>
  <p:input port="doctype" content-types="text" sequence="true">
    <p:empty/>
  </p:input>
  <p:output port="result" primary="true" content-types="xml"/>
  <p:output port="report" sequence="true" content-types="xml json"/>
  <p:option name="report-format" select="'xvrl'" as="xs:string"/>
  <p:option name="serialization" as="map(xs:QName,item()*)?"/>
  <p:option name="assert-valid" select="true()" as="xs:boolean"/>
</p:declare-step>
  1. The simple case, you pass a document with a doctype-system serialization property (on the document or the step). We serialize the document with the necessary doctype declaration and validate it.
  2. You provide a doctype, we serialize the source document (without a doctype declaration or XML declaration), slap the doctype you provided in front of it and validate it.
  3. If you want to do anything funky with entity replacements or some such, you construct the text of the document you want to parse, by whatever means you want, and we validate it.
xml-project commented 4 months ago

Most probably missed something important, but I am confused what the report result port is for. If the validation succeeds, nothing “interesting” is in the documents on this port. If it doesn’t, the report document is not available because a dynamic error is raised. What do I miss?

ndw commented 4 months ago

Several comments back, @gimsieke persuaded me that we should put the assert-valid option back and just pass the original document through if assert-valid is false() and an error occurs.

xml-project commented 4 months ago

@ndw thanks. Now I know what I missed. :-))

xml-project commented 4 months ago

@ndw Two questions came up, while trying to implement the new suggestion:


<p:declare-step type="p:validate-with-dtd">
  <p:input port="source" primary="true" content-types="xml html text"/>
  <p:input port="doctype" content-types="text" sequence="true">
    <p:empty/>
  </p:input>
  <p:output port="result" primary="true" content-types="xml"/>
  <p:output port="report" sequence="true" content-types="xml json"/>
  <p:option name="report-format" select="'xvrl'" as="xs:string"/>
  <p:option name="serialization" as="map(xs:QName,item()*)?"/>
  <p:option name="assert-valid" select="true()" as="xs:boolean"/>
</p:declare-step>
´´´
Please excuse this questions, if they are stupid, but I am not a DTD-expert.
1. What is supposed to happen, if a Text document appears on port "source"?
2. Is an HTML document appears on port "source", is the result type "xml" correct? 
ndw commented 4 months ago

A text document is allowed so that you could construct something like this:

<doc>
  &chap1;
  &chap2;
</doc>

where presumably the chap1 and chap2 entities are defined in the doctype. There's no way to get unexpanded entities into a parsed XDM, so you'd have to do it this way. I haven't thought very hard about how difficult it will be to make a text document that serializes correctly!

DTD validation sort-of implies XML, so I think making the result always be XML makes sense. If you think it makes more sense to give a document with a root element of (X)HTML an HTML content type, I can see how that might make sense too.

xml-project commented 4 months ago

@ndw Thank you!

ndw commented 2 months ago

Hi folks. I've pushed an update that simplifies the p:validate-with-dtd step along the lines that I described in a comment above.