Reading XML files from an archive with weird filenames

xatapult commented 4 years ago

I ran into a use-case for which we have no elegant solution in XProc 3.0 but is quite common. @gimsieke , I think this is rather important for transpect too!

MS Office files (like .docx and .xlsx) are zip files with all kinds of XML inside. Some have filenames like .rels. Since this is not a standard XML file extension, p:unarchive does not recognize it as XML and loads it as binary. But you need the XML to interpret the office format correctly.

The base cause is of course that we load the contents of an archive in one go using p:unarchive. There is no mechanism to specify the content types of the individual files, you have to leave that to the processor which does its best but is sometimes wrong.

So how can we solve this? I see several solutions:

We do nothing. To convert the binary to XML, you'll have to write it to disk (to a temp file) and then read it in again. Works but is expensive and rather inelegant.
We add some feature to (probably) p:cast-content-type that forces it to view a binary file as text so you can convert it to XML after its loaded.
We add a mechanism to p:unarchive to help the processor guessing the file types. Maybe just a content-type attribute that forces all unarchived files to this type and then use p:unarchive to load the files one-by-one or in controlled sets using include-filter and exclude-filter.

Actually, my preference would be to implement both 2 and 3. Both are useful in their own way and probably other use-cases benefit also.

gimsieke commented 4 years ago

My understanding was that processors can use any method that they deem appropriate in order to guess the correct media type. So I think they will at least support common XML archive members with unorthodox (or no) extensions.

In other functional languages one might solve this by submitting a function item to the step that accepts the file name and returns the content type.

Function items aren’t well supported in XProc 3.0, but maps are. So what about a map(xs:string, xs:string) with regular expressions as keys and content types as values? The semantics would be: If this map option is given and if a file name matches the regex supplied as a key, then the content type supplied as a value is returned, and the archive member is treated as if it were read with p:load with the respective content-type option.

If a given file name matches multiple keys and these keys are associated with different content types, it is undefined which content type is selected.

Otherwise (if no key matches), the processor-dependent mapping rules apply that would apply in the absence of this map option.

xatapult commented 4 years ago

I like Gerrit's solution. Sounds not too hard to implement. Opinions?

xatapult commented 4 years ago

Is an extension to cast-content-type to treat binary as text if needed something to be considered as well?

Once I ran into it it looked like an oversight we can't get from binary to xml.

gimsieke commented 4 years ago

I agree that it would be nice to use p:cast-content-type to attempt to parse an “other” document as XML, HTML, text, or JSON. Haven’t we discussed this already?

ndw commented 4 years ago

I don't recall if we discussed the p:cast-content-type issue or not, but yes, I think attempts to cast binary to XML, HTML, text, or JSON, should succeed if the actual octets can be parsed into that format given their actual content type (which may include an encoding, I fear).

The only hesitation I have about the map idea on p:unarchive is that I think the problem it's solving applies more broadly than just p:unarchive. I've had to extend the JVM content type list several times in the course of testing and demos. I would rather not have a solution that works for p:unarchive but doesn't work for p:load (for example). I wonder if the JVM content type solution works in p:unarchive. Seems like it should, but I haven't tested it.

See: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8039362

xatapult commented 4 years ago

@ndw The solution for p:unarchive as proposed (with the map) is IMHO necessary only for p:unarchive because p:unarchive loads a whole bunch of files in one go.

The situation for p:load is rather different: You load a single file and the step already has a content-type option to override the automatic type guessing.

xatapult commented 4 years ago

I created a new issue #365 for further discussion of the extensions to p:cast-content-type.

This issue is now for discussing an optional extension to p:unarchive only.

ndw commented 4 years ago

I agree that the details of p:unarchive vs p:load are different but they're also similar. If I get URIs from some other step, it may not be straightfoward to specify the override content type. In fact, if what's being loaded is some external, exploded archive, then the use case might even be exactly the same.

If we can avoid having to have two (or more) different mechanisms for things that are closely related or the same, that would be better. IMHO.

xml-project commented 4 years ago

FYI: In MorganaXProc-IIIse 0.9.1.6 I implemented the following solution to the original problem with p:unarchive: Option parameters now accepts an entry for mox:content-types' which has to be associated with an instance of array{array()). Therefore you can call p:unarchive like so:

<p:unarchive include-filter="\S+\.rels"
  parameters="map{'mox:content-types' : 
    [
      ['\S+\.rels', 'application/xml']
    ]
  }" xmlns:mox="http://www.xml-project.com/morgana">
  <p:with-input href="some-file.docx" />    
</p:unarchive>

For each entry in the array an array with cardinality 2 is expected: The first entry is taken as a regular expression, the second entry is taken as a content-type. The array is inspected in order: If the path of the unarchived document matches the regex, the given content-type is taken to interpret the unarchived document. No other regex is considered. If no regex is matched, processors standard association applies. The implementation is clearly inspired by @gimsieke suggestion in his comment, but I went for arrays instead of a map. The basic advantage is that arrays have order, so they could be inspected in turn while in a map, I have to inspect ALL and raise an error if more than one is matched. Additionally it allows users to specify lists of regex which increasing levels of granularity.

gimsieke commented 4 years ago

Another question is whether the regexes are implicitly anchored, that is, whether you need to specify a regex that matches the file name front-to-back. This seems to be your approach (\S+\.rels), while in mine you can specify \.rels$. I think any specification or implementation should document whether matching is auto-anchored or not.

gimsieke commented 4 years ago

It probably should be interpreted in the same way as the include-filter and exclude-filter regexes, of which we also don’t say whether they are auto-anchored or not, at least when used on p:unarchive.

For p:directory-list, our examples imply that they are not auto-anchored, that is, you can also specify a regex that matches parts of the relative path.

I think we should unify all these filters on p:unarchive and p:directory-list:

The regexes see the complete relative paths in the archive or below the directory given in $path, not just the base names of archive/directory members.
The regexes match parts of the name, they are not tacitly anchored. If a pipeline author wants to anchor them, they need to use ^…$.

xml-project commented 4 years ago

@gimsieke Please mind that all antries in option parameters of p:unarchive are implementation defined. So what I stated above is an info about my solution and not a proposal to anything.

xatapult commented 4 years ago

No, but we could turn it into a proposal and add it (using a different, none Morgana specific name) to the spec, right?

ndw commented 4 years ago

Would this be more useful globally or on a per-p:unarchive basis?

xatapult commented 4 years ago

There is a (I think) low impact solution to this: Add a content-type option to p:unarchive. If you specify this it will treat all unarchived files as this content type. Then unarchive the files one by one or in batches with the same content-type. What you need up-front for this is some idea of the content types of the archive's entries. See #368

xatapult commented 4 years ago

We take Achim's idea and add an option to both p:unarchive and p:archive-manifest to specify content type overrides.

xatapult commented 4 years ago

@gimsieke About not anchoring the filename matching regexps:

Can we state that the regexp matching works like the XPath matches() function (with 2 parameters)?

matches(path-to-match, regexp)

gimsieke commented 4 years ago

I think so

xatapult commented 4 years ago

Do we need a check on the value of a content-type (the seconds members of the inner arrays) or is this just a string and can you specify [ ['.rels$', ' what a nonsense 
 '] ]?

If we need checking, what would be the definition of a content-type?

There is already an error for specifying an invalid content type (XC0130), coming from p:http-request. Maybe we can re-use this?

If so, maybe there are more places where we need to add something about this (p:cast-content-type?)

xatapult commented 4 years ago

This will be covered by an error listed in the core spec.

xatapult commented 4 years ago

Done with #379, but the debate is not finished, maybe some additional changes are necessary.

xatapult commented 4 years ago

Debate finished and follow up in #385. CLosing this issue for now.

xproc / 3.0-steps

Reading XML files from an archive with weird filenames #364