xproc / 3.0-steps

Repository for change requests to the standard step library and for official extension steps
10 stars 7 forks source link

Output of empty p:unwrap #470

Closed xatapult closed 3 years ago

xatapult commented 4 years ago

Recently I had the situation where a p:unwrap resulted in nothing: <p:unwrap match="/*"> on a document with just an empty root element like <some-root-element/>.

To my surprise (before reading the specs) in Morgana it resulted in an empty document node or (I'm not sure) a document with an empty (?) text node. This is correct I suppose although the spec does not make this explicit.

Can we make this situation more explicit and into something that makes more sense? I suggest to change the signature of the output port to sequence="true" and output empty in a case like described above.

gimsieke commented 4 years ago

Hmm, not sure whether such a fringe case justifies changing the result to sequence="true".

Did I understand it correctly: the only case that you think is not clearly specified if the result does not contain a text node at all?

I think then the result should be a document node with no children. Empty text nodes are not allowed in a parent node. But a document node may well be empty.

I’d read the spec that the content type of this document-node-only document is still application/xml or whatever the input document’s content type was.

xatapult commented 4 years ago

Maybe you're right. Probably it's just a matter of explicitly mentioning and specifying this fringe case in the specs.

gimsieke commented 4 years ago

And probably Morgana is right, too. I don’t have it installed, but can you inspect more closely the content type and count(/node()) of the unwrapped document? It should be application/xml and 0, respectively.

xatapult commented 3 years ago

Unwrapping an empty element becomes a text document without contents.

ndw commented 3 years ago

...per the editorial team meeting on 22 October 2020

xatapult commented 3 years ago

I think I've opened a little can of worms with this. Even after the changes I already made, there are still some things left I think that should be specified or at least mentioned:

Gerrit mentions here that the serialization attribute will be removed when the content type changes. There is however no mention of that in the step's description. And I'm wondering: Is that necessary? Why not simply retain it, any no longer relevant serialization options ar simply ignored, right? If we do want to remove it, it should be mentioned in the step's description.

I was also wondering what would be the outcome of <p:unwrap match="/*"> of:

<?Some processing instruction(s)?>
<!-- and/or some comment(s) -->
<an-empty-root-element/>

I think: A document node with the comments/processing-instructions as children and content-type unchanged (not making it text/plain). So it would be not well-formed XML (which we already agreed upon that's ok).

The same is true in the example above when the root element just contains some text: The content-type will not change then.

gimsieke commented 3 years ago

The serialization property will be removed because it is specified in § 3.1:

If a step changes the content-type in this way, it must also remove the serialization property.

The code example: A document node with a PI, a comment, and also probably two whitespace-only text nodes after each. Unless some standard says that whitespace must be ignored outside of a top-level element, but I don’t think this is the case. Content-type unchanged, yes.

If the top-level element contained a text node, it will be merged with the whitespace nodes I guess. In any case, if there is a comment and/or PI, the document will retain its XML content type.

xatapult commented 3 years ago

Thanks @gimsieke, missed that line in 3.1.

Glad we're in agreement about the other things 😉

ndw commented 3 years ago

I'm not sure exactly how to word it off the top of my head, but if there are comments or PIs then it has to remain an XML document. It only becomes a text document if the result of unwrapping is a single text node.

xatapult commented 3 years ago

@ndw Yeah. And since it already was an XML document to begin with, we can leave the content-type unchanged.