xproc / 3.0-specification

A community-driven effort to define an XProc 3.0 specification (formerly 1.1)
http://spec.xproc.org/
33 stars 10 forks source link

Clarification of @collection #555

Closed xatapult closed 5 years ago

xatapult commented 6 years ago

I think I misinterpreted the meaning of @collection on p:variable and p:with-option. The spec's description is currently rather sparse and IMHO needs some clarification:

All of the documents that appear on the connection for the p:variable will be available as the default collection within select expression.

I'll try to write some more prose on this if somebody can explain what is meant and/or send me a simple code example?

gimsieke commented 6 years ago

Example: <p:variable name="count" as="xs:integer" collection="true" select="count(collection())"/> ⇒ 3, if 3 documents are on the DRP.

xatapult commented 6 years ago

Ok. Aha. So this also means that you can access the n-th document by writing collection()[n], that's nice.

But what happens when non-XML documents are on the DRP?

gimsieke commented 6 years ago

I think non-XML documents will just be an empty document node (citation needed), or a text node within a document node for text documents.

gimsieke commented 6 years ago

Here’s a more elaborate definition of the context item: http://spec.xproc.org/master/head/xproc/#err.inline.D0008 JSON document are represented by their XDM representation, that is, an array or a map? So if collection()[2] is of content type text/json, it is not a text node wrapped in a document node, but instead an array or map? Binary documents are implementation defined. So contrary to what I believed, they are not necessarily represented as a document node. However, document-properties(collection()[3]) should return the document property map if the third document is a binary file. You should be able to do the following, irrespective of the representation:

<p:variable name="binary-doc" as="???" collection="true" select="collection()[3]"/>
<p:store href="myfile.bin">
  <p:with-input name="source" select="$binary-doc">
    <p:empty/>
  </p:with-input>
</p:store>

Should you be able to give a document in a select attribute? The context for p:store here is still the DRP with 3 documents on it. So could we also say, if collection were allowed on p:with-input, <p:with-input name="source" select="collection()[3]" collection="true"/>?

What do we (interoperably) specify as the as attribute value if the variable is supposed to hold a binary document?

xatapult commented 6 years ago

As always, complications...

May I propose the following:

  1. If the document is an XML document everything is hunky dory
  2. If the document is a text document there is only a single text node as child of the document node
  3. If the document is anything else (JSON also), there are no children underneath the document node. (since you can't represent a map or array in a node tree this must apply to JSON documents also, there should be some other means to get to its map/array representation)
ndw commented 6 years ago

<aside>I don't think collection()[2] is going to do what you want; the order of documents in the collection may not be stable.</aside>

The question of what to do with maps is an interesting one. We want JSON to be able to flow through the pipeline. We want to represent JSON as XDM maps. XDM maps aren't nodes. So I think we've just painted ourselves into a corner that says what flows between steps are XDM instances not documents. Bah, humbug.

Non-node values can't go into collections so either we have to serialize them and make them nodes or we have to leave them out of collections. Bah, double humbug.

xml-project commented 6 years ago

Non-node values can't go into collections so either we have to serialize them and make them nodes or we have to leave them out of collections.

Why not? The XPath 3.1 specification say:

Default collection. This is the sequence of items that would result from calling the fn:collection function with no arguments.

So in my reading, any instance of item (document nodes, text nodes etc, and maps) can be part of the default collection. What did I miss?

So I think we've just painted ourselves into a corner that says what flows between steps are XDM instances not documents.

Yes, we actually use document in a double sense, this was why I introduced the term "XProc document" in my London paper in June: What follows between steps in XProc is an (XProc) document.
XProc document are pair of properties and representations. A representation may be an XDM document or a map.

xml-project commented 6 years ago

@gimsieke

What do we (interoperably) specify as the as attribute value if the variable is supposed to hold a binary document?

Answer: item()*

ndw commented 6 years ago

Sorry. My bad. I was looking at the XPath 3.0 functions and operators spec where fn:collection() returns node()*. I see that in 3.1 it returns item()*. Ignore that bit.

xatapult commented 6 years ago

Ok, looks fine. So summarizing:

  1. If the document is an XML document its a normal node document tree
  2. If the document is a text document there is only a single text node as child of the document node
  3. If the document is JSON you get a map or array
  4. If the document is binary you get item()*, unspecified, implementation defined

Ok. I'm unsure about 4. @xml-project, Is that what you meant.

@ndw We'll have to say something about the order of documents. But why wouldn't that be stable. Documents flow in a certain order, right?

xml-project commented 6 years ago

@eriksiegel

Ok. I'm unsure about 4. @xml-project, Is that what you meant.

Yes. You will get what you get, because we define the behavior of binary documents only on the XProc level, not on the XPath level were we are now.

I think your conclusion for JSON is not quite right: For documents with content-type application/json we decided to use fn:parse-json() and I think this is also true for collection(). The function specs say:

JSON-object -> Map JSON-array -> Array JSON-string -> xs:string JSON-number -> xs:double JSON-boolean -> s:boolean JSON-null -> EMPTY-Sequence

So IMHO the correct answer (expressed as SequenceType) for JSON is item()?.

gimsieke commented 6 years ago

About order:

In

<p:identity>
  <p:with-input port="source">
    <p:document href="doc1.xml"/>
    <p:document href="doc2.json"/>
    <p:document href="image.png"/>
  </p:with-input>
</p:identity>
<p:variable name="png" select="collection()[3]" collection="true"/>

$png is guaranteed to contain the image.png document. This is stated in the note that immediately precedes http://spec.xproc.org/master/head/xproc/#documentation.

Order would not be guaranteed if you connect to the secondary port of a p:xslt step and, for ex., expect the text document to be the first output document on this port, see https://github.com/xproc/1.0-specification/issues/17

xml-project commented 6 years ago

@gimsieke Sorry, but I thought we were talking about the order in which the XPath-function collection() returns the documents, not about the order on an XProc port (the passage you have quoted).

I agree which @ndw that the specs of XPath-function collection() does not define an order for the sequence, so you can NOT be sure, that image.png is returned by collection()[3].

I think that is why XPath has function fn:collection(arg as s:string?) (arg is interpreted as uri) and the function will return the document (in the default collection) with this URI (if any).

gimsieke commented 6 years ago

Implementations should be required to let collection() return the documents in the order in which they appear on the port. Is there a reason not to stipulate this?

xml-project commented 6 years ago

@gimsieke

Is there a reason not to stipulate this?

We are not in a position to stipulate this, because we are not the XPath next community group. collection() is an XPath function defined in their specs. How can we change their specs?

gimsieke commented 6 years ago

I don’t see anything in https://www.w3.org/TR/xpath-functions-31/#func-collection that would prevent us from returning the default collection in a specific order.

xml-project commented 6 years ago

Sorry @gimsieke , I failed to make my point: We (which means in this case the XProc implementors) do not return anything here. We call an XPath processor to execute the XPath expression containing "fn:collection()". And the XPath processor evaluate the expression according to the XPath specs. And since the specs do not guarantee order, there might be order or not.

I do not see, what we (the XProc next community group) could do about this?

gimsieke commented 6 years ago

Saxon for example has no built-in default collection. If I read this code correctly, @ndw constructs a default collection that he passes to net.sf.saxon.lib.CollectionURIResolver. This is for XSLT. For XProc 3.0 constructs that accept @collection, I assume that Norm will continue to use Saxon as the XPath processor. For these XPath expressions (outside of XSLT), you have your own XPath processor. What prevents you from defining the default collection in a specific way?

xml-project commented 6 years ago
  1. As far as I remember "CollectionURIResolver" is deprecated since 9.7. I looked up the APIs yesterday to see whether there are informations, but there are none. There is a new interface "CollectionFinder", but there is also no hint about order (and stability).

  2. I do not think Saxon Api can count as argument, because we are not building "XProc on Saxon".

  3. I do not think the problem is worth the whole discussion because you could easily use p:split-sequence to solve the problem. So IMHO there is no need to deviate from XPath standards or tie our specification to a specific XPath processor.

ndw commented 6 years ago

I have no reason to believe that the collection() function returns the documents int he same order that I passed them to the collection URI resolver (or whatever the new interface is).

It's called collection not sequence because it's an unordered collection, I believe.

gimsieke commented 6 years ago

I am not suggesting to tie XProc to a specific XPath processor. I am just proposing that each implementation be required to return the default collection in the order that the documents that appear on the corresponding port already have. In certain circumstances, the order in which they appear is already specified by the XProc spec.

gimsieke commented 6 years ago

And I’m asserting that this does not deviate from the XPath spec.

ndw commented 6 years ago

That is not within my control. I pass a bunch of documents off to Saxon to put in a collection. I don't know how Saxon keeps track of those. Maybe Michael puts them in a map and the insertion-order is lost. Maybe he doesn't. Whether or not they come back in the order I added them is at best implementation-dependent.

xml-project commented 6 years ago

I think @ndw comment should be the bottom line under the "order"-discussion Gentleman.

gimsieke commented 6 years ago

Ok. Returning to @eriksiegel’s comment, maybe we should add a note to the default collection. Something like: “A specific XProc processor in a specific version might return collection items in a certain order, and maybe it is the order that the items appeared on a port. However, you should not rely on accessing collection items by position (for example, collection()[3]). Use other criteria, such as base URIs and other document properties, top-level element names or namespaces, or map keys in order to select specific items from a collection.”

xatapult commented 5 years ago

Fine with me. I'll add some more prose to this to clarify.

ndw commented 5 years ago

@eriksiegel proposes that #565 also fixes this. I'm happy with that.