xproc / 3.0-steps

Repository for change requests to the standard step library and for official extension steps
10 stars 7 forks source link

Question about storing Xdm documents #355

Closed xml-project closed 4 years ago

xml-project commented 4 years ago

As far as I understand the specification, any XDM document will be stored, even it is not a well-formed XML document (e.g. not exactly one top level element node, non-whitespace text nodes). Right?

Would it make sense to add an option to every serializing step (p:store, p:http-request ...) that makes sure, only well-formed documents are serialized. If not well-formed an error could be raised. Did I miss something?

gimsieke commented 4 years ago

fEaTuRE fReeZE !!!!!1!!!!11!!!!

More seriously, why not?

One question is whether the processor is able to determine in advance each and any case in which non-well-formed output will be written to disk. People might get creative with character maps and a processor can only determine non-wellformedness after attempting to parse the serialized document again.

If we only want to check whether a single tree (plus optional whitespace, comment, or PI nodes) is going to be serialized, we could maybe add something like an XProc-specifi c:assert-single-tree option to the serialization map.

It might occur that there is already such an option. Have a look at the paragraph that precedes this heading in the serialization spec:

It is a serialization error [err:SEPM0004] to specify the doctype-system parameter, or to specify the standalone parameter with a value other than omit, if the instance of the data model contains text nodes or multiple element nodes as children of the root node. The serializer MUST either signal the error, or recover by ignoring the request to output a document type declaration or standalone parameter.

To my understanding, you need to set standalone to any other value than 'omit' in order to raise this error for multiple-tree documents. On the other hand, error SEPM0004 doesn’t allow text nodes at all. I’d think that whitespace-only text nodes are allowed around a top-level element and should be serialized.

xatapult commented 4 years ago

??? How would you serialize an XDM document that is not well-formed? Isn't that un-serializable by definition?

xml-project commented 4 years ago

@xatapult fair enough: XSLT and XQuery serialization says:

The XML output method serializes the normalized sequence as an XML entity that MUST satisfy the rules for either a well-formed XML document entity or a well-formed XML external general parsed entity, or both. A serialization error [err:SERE0003] results if the serializer is unable to satisfy those rules, except for content modified by the character expansion phase of serialization, as described in 4 Phases of Serialization. The effects of the character expansion phase could result in the serialized output being not well-formed, but will not result in a serialization error.

However: For example in p:store there is no word about serialization errors, so I think the generic XD0030 has to be raised. To my understanding it makes more sense to add a specific XProc error code for this.

gimsieke commented 4 years ago

Please read the next paragraph:

If the document node of the normalized sequence has a single element node child and no text node children, then the serialized output is a well-formed XML document entity, and the serialized output MUST conform to the appropriate version of the XML Namespaces Recommendation [XML Names] or [XML Names 1.1]. If the normalized sequence does not take this form, then the serialized output is a well-formed XML external general parsed entity, which, when referenced within a trivial XML document wrapper like this:

<?xml version="version"?>
<!DOCTYPE doc [
<!ENTITY e SYSTEM "entity-URI">
]>
<doc>&e;</doc>

Otherwise XQuery wouldn’t be able to serialize multi-top-level-element stuff as XML, would it?

Also you are quoting the superseded 3.0 serialization spec, but the content should be the same.

xml-project commented 4 years ago

Also you are quoting the superseded 3.0 serialization spec, but the content should be the same.

Oops, sorry about that.

But thats brings us back to the original question, doesn't it?

gimsieke commented 4 years ago

We don't need to introduce a new error or option because people can specify true or false for the standalone entry of the serialization map in order to trigger an error for multiple-tree documents.

xml-project commented 4 years ago

OK, but that error would be XD0030, right? There is no guarantee that the underlying serialization error is exposed to the pipeline. Do you think that would be sufficent?

gimsieke commented 4 years ago

I think it's a quality of implementation issue. What are the error codes in the serialization spec for if host implementations don't raise them?

xml-project commented 4 years ago

I do not think so: An XProc implementation has to raise an XProc error as stated in the specs. For store etc. a serialization error is not mentioned, so a conformant XProc processor has to raise XD0030. I agree, that the error message itself is a quality of implementation but that is not related to the question which error codes could be handled in a p:catch. We haven't defined an interoperable way to access underlying errors in XProc, so a pipeline catching the error in XML Calabash will likely not work in MorganaXProc and vice versa. A second thought: After reading the section on "serialization" again I realized that an XProc processor is not required to support 'standalone'.

gimsieke commented 4 years ago

Where do we state that only XProc errors may be raised? What is an XProc error, anyway? We say that implementations should use the c:error vocabulary for the error documents, but we don’t say that they may raise only error codes that are defined in an XProc spec.

In the Value Templates section, we say for example: “The error is signaled using the appropriate XPath error code.” And I think we are referring to XPath error codes here.

The p:xslt Error Example doesn’t contain an error code at all. I’d expect for p:xslt that any xsl:message[@terminate='true']/@error-code would bubble up, as would error codes defined in the XSLT spec.

We can add another sentence that the most specific error code should be raised, even and in particular if it is defined in specs that we build upon (and that are in the normative references).

(Is it possible to throw multiple error codes? Then if an implementation encounters an exception with code SEPM0004 from a serializing class, it may add XD0030 and let both codes bubble up.)

About standalone: If the processor doesn’t support it, then I’d say that users of these processors cannot generate such an error. The thought is to use existing error codes of the underlying specs wherever possible instead of replicating or subsuming these errors in XProc.

Before we continue with the discussion I’d like to get more feedback from @ndw and @xatapult particularly on the question of bubbling up “foreign” error codes.

xml-project commented 4 years ago

The correct quote would be:

It is a static error if the string contained between matching curly brackets in a value template, when interpreted as an XPath expression, contains errors. The error is signaled using the appropriate XPath error code.

I am talking about dynamic errors obviously. And I like your idea that a processor is free to raise any error it likes. Makes life for an implementer a lot easier.

xml-project commented 4 years ago

Close?

xml-project commented 4 years ago

As there is no discussion here and since early March no one objected against closing this issue: Closed.