xproc / 3.0-steps

Repository for change requests to the standard step library and for official extension steps
10 stars 7 forks source link

p:split-sequence for non-xml documents #362

Closed xml-project closed 4 years ago

xml-project commented 4 years ago

While twiddling with p:unarchive on a word document yesterday, I was quite surprised that the following does not work:

<p:unarchive>
  <p:with-input href="some-document" />
</p:unarchive>

<p:split-sequence test="ends-with(p:document-property(., 'base-uri'), 'word/document.xml')" />

I expected this to work and I do think, it would be quite useful to split sequences on the base of document properties. However the port sourceon p:split-sequence is defined as "xml html". So it does not work if there are other documents on the port.

Is there any reason I do not see why we need the restriction on the input port? Would it not make sense to have content-type="any" for the input and the output ports?

gimsieke commented 4 years ago

I don’t see a reason to restrict it to xml html, either.

xml-project commented 4 years ago

@ndw @xatapult Objections?

ndw commented 4 years ago

No objections from me. What is the consequence of attempting to evaluate an XPath expression on a non-XML document in this case? Suppose the test is count(//*) gt 3 for example? Is the step required to raise an error (a particular error?) or does the test simply return false?

gimsieke commented 4 years ago

For JSON documents (represented as maps, for example), a static error “XPTY0020: the context item is not a node” will be raised if you try to refer to the context, at least this is what Saxon does.

For other documents, we say that they are “empty XDM documents”. What does that mean? If they are represented by an otherwise empty document node, //* will return the empty sequence.

xml-project commented 4 years ago

Right: For count(//*) gt 3 you will get an error, if it is a JSON document. But this is a special (bad?) example because there are a lot of XPath expressions making sense here.

I think we do not need to change anything in the inner logic of p:split-sequence: The prose says thats the document is taken as the context item and the decision is made based on the effective boolean value of the test-expression. Pipeline authors therefore could use any XPath expression they like. Some, like the one @ndw mentioned, will raise an error if I carelessly suppose that all documents will have a document node. So the right way for the count-example would probably be:

<p:split-sequence test="if (. instance of node()) then count(//*)=3 else false()" />
xatapult commented 4 years ago

No objections (at all). On the contrary: all for any

xml-project commented 4 years ago

After playing a bit with the XSLT 3.0 rule, that an error in a pattern is a non-match, I have to say, I realy like it. Therefor I changed my mind and would propose, that the XPath expression in p:split-sequence is applied to the documents in turn: If the result is an effective-boolean-true, the document appears on the matchport, if it is an effective-boolean-false or an error is raised, the document appears on the non-match port. Does this make sense to anyone? Do we need an additional attribute fail-on-error (false by default), that make the step explicitly raise an error instead of sorting the document into the non-matched sequence? I think this is a no-brainer but I wonder whether the additional complexity is worth it. Thoughts?

gimsieke commented 4 years ago

I’m a bit hesitant because in all other cases where we match by XPathExpression or XSLTSelectionPattern (for example, p:add-attribute/@match), we don’t treat errors as non-match but as errors.

And I don’t think we should extend the XSLT rule that an error in a pattern (or in an expression, in our case here) is a non-match to all other XPathExpression and XSLTSelectionPattern attributes. (I’m not implying that you suggested this, I’m just pondering whether this is a rule that should be generalized.)

Apart from split-sequence, it might make sense to treat errors as false() in p:wrap-sequence/@group-adjacent so that two adjacent documents with either erroneous or false()-returning expressions will be grouped (because deep-equal(false(),false())=true()).

If we do so for both p:split-sequence and p:wrap-sequence, a fail-on-error option might be useful. If you as an implementor are willing to accept the additional complexity, that is.

xml-project commented 4 years ago

To my understanding all XSLTSelectionPattern are defined by XSLT 3.0, so an error is a non-match by definition. Is there are ruling in XProc saying the opposite? Apart from that: Expanding it to XPathExpressions is a point where I am also not sure, we want to do this. Pro: Shorter XPath-Expressions because I do not need a if () to prevent an error (if I do not want one). Con: It would be OUR rule and might lead to confusion.

Don't know. May be we should call it all of and leave p:split-sequence XML/HTML only. If you want to split a mixed sequence, you could always use a p:for-each with a p:choose as child.

gimsieke commented 4 years ago

You are right, apart from certain areas such as AVTs or when we refer to context items (as in count(//*)) in conditionals or when declaring variables we don’t say that an error must be thrown if an XPath expression is erroneous. But we shouldn’t leave it to an implementation to decide whether an XPath error is thrown or whether the expression will be tacitly assumed as false() in certain contexts.

This and the original question is maybe something for our next call.

xml-project commented 4 years ago

But we shouldn’t leave it to an implementation to decide whether an XPath error is thrown or whether the expression will be tacitly assumed as false() in certain contexts.

Definitely not!

xml-project commented 4 years ago

Fixed with #399