Filtering input documents

rdeltour commented 9 years ago

(sorry to hijack the thread on public-xml-processing-model-wg list by creating an issue here, but that was the easiest way I found to comment).

I had a quick look at @ndw's proposal to filtering document in ndw/specification@f4e6b74a9ee51539bbb7b5684625a62879c8c973, some comments:

the p:filter element being used for a step in the standard library, there's a risk of confusion using it in this context (the 1.0 step is a bit of a false friend IMO –I always find myself trying to use it to filter a sequence of documents when it isn't capable of that– but it's probably too late to step back on this naming).
FWIW I agree with @xquery's comment that having it as a child of p:input introduces an inconsistency; there's also a risk of confusion wrt to where it is inserted (e.g. does the p:filter only applies to documents produced by previous-sibling connectors or all?)
why not instead use the input/@select attribute for filtering input documents? It would be possible if the sequence of connected documents was available in the default collection of the XPath context, in a similar fashion to what is proposed in #137.

Instead of what is currently proposed with a p:filter:

<p:input port="source">
  <p:pipe step="someSource" port="result"/>
  <p:filter include="contains(
                        map:get(p:document-properties(.),'content-type'),
                        'xml')"/>
</p:input>

You'd have:

<p:input port="source" select="collection()[contains(
                        map:get(p:document-properties(.),'content-type'),
                        'xml')]">
  <p:pipe step="someSource" port="result"/>
</p:input>

I'm surely overlooking things, but wanted to jot that down while it's fresh...

ndw commented 9 years ago

[ Not sure how best to manage the split conversation. Here's what I sent in reply to Jim ]

Yes, this proposal definitely mixes things in p:input. And p:filter is probably a poor choice of names given that we already have a p:filter step. So, imagine we'll call it something else eventually.

I was being lazy about the content model; for the sake of Romain's question about ordering, let's force all the filters to be at the end of the content model.

The filter element is a child of p:input because there's no where else to put it, really. The idea is that it applies (they apply) to the sequence of documents appearing on that input port.

Now that we have non-XML documents in the pipeline, I can imagine that there will be more streams of "mixed" documents (contents of ZIP files, contents of directories globbed, etc.). Some steps will want to process only the images, some only the XML, etc. Rather than having to filter the streams in separate steps, my thinking is that this simplifies a common case.

It was inspired by ant and gradle features that allow you to grab a bunch of files, because that simplifies the selection, and then explicitly exclude some. I suppose exclude is all you really need, but I liked the parallelism of include/exclude.

Using the select attribute on p:input only solves the very simple case. But maybe that's enough.

(The case it doesn't handle is when you want to use @select to process some interior portion of the documents because then you can't (easily) make @select do both.)

ndw commented 9 years ago

Actually, I don't think this can work at all:

<p:input port="source" select="collection()[contains(
                        map:get(p:document-properties(.),'content-type'),
                        'xml')]">

The select expression on p:input applies to each document in turn. The collection() function isn't meaningfully defined.

I suppose select=".[contains(...)]/expr" would work, but it's a little subtle.

rdeltour commented 9 years ago

[ Not sure how best to manage the split conversation. Here's what I sent in reply to Jim ]

(yeah, sorry again for this duplication. The list is not open to public posting though, so I'll keep using the tracker.)

Using the select attribute on p:input only solves the very simple case. But maybe that's enough.

(The case it doesn't handle is when you want to use @select to process some interior portion of the documents because then you can't (easily) make @select do both.)

I'm curious to see concrete examples of where it w/b limited. IMO using @select is more powerful than either-or; it can do both because XPath can (it's a matter of applying a predicate to the collection sequence and then selecting nodes for each item in the sequence.

filtering documents

<p:input port="source" select="collection()[contains(
                        map:get(p:document-properties(.),'content-type'),
                        'xml')]">
  <p:pipe step="someSource" port="result"/>
</p:input>

selecting elements of the document(s)

<p:input port="source" select="//html:div">
  <p:pipe step="someSource" port="result"/>
</p:input>

which w/b specified as being equivalent to

<p:input port="source" select="collection()//html:div">
  <p:pipe step="someSource" port="result"/>
</p:input>

selecting elements from filtered documents

<p:input port="source" select="collection()[f:my-filter-expression()]//html:div">
  <p:pipe step="someSource" port="result"/>
</p:input>

As with before, there w/b rules to specify what kind of result sequence is allowed, how nodes are wrapped in documents, etc.

vojtechtoman commented 9 years ago

Instead of thinking about "filter" as a new child element of p:input, what about considering it a new type of binding/connection, in addition to p:document, p:pipe, p:data etc.? For instance, if it is defined as follows:

<p:filter
  include? = XPathExpression
  exclude? = XPathExpression>
      (p:document |
       p:inline |
       p:pipe |
       p:data |
       p:filter)+
</p:filter>

then it can be a very transparent feature with the added advantage that you could use it anywhere where you can use the other bindings.

rdeltour commented 9 years ago

(oops I hadn't seen you replied in between)

The select expression on p:input applies to each document in turn.

right, although the v2 spec might be able to change that ?

The collection() function isn't meaningfully defined.

the idea was to define this default collection of the XPath context, consistently to what is proposed in #137

ndw commented 9 years ago

At the 10 June 2015 face-to-face, we determined that the editor's current draft of input filtering was poorly conceived and decided to abandon it.

xproc / 1.0-specification

Filtering input documents #154