Closed xatapult closed 4 years ago
My understanding was that processors can use any method that they deem appropriate in order to guess the correct media type. So I think they will at least support common XML archive members with unorthodox (or no) extensions.
In other functional languages one might solve this by submitting a function item to the step that accepts the file name and returns the content type.
Function items aren’t well supported in XProc 3.0, but maps are. So what about a map(xs:string, xs:string)
with regular expressions as keys and content types as values? The semantics would be: If this map option is given and if a file name matches the regex supplied as a key, then the content type supplied as a value is returned, and the archive member is treated as if it were read with p:load
with the respective content-type
option.
If a given file name matches multiple keys and these keys are associated with different content types, it is undefined which content type is selected.
Otherwise (if no key matches), the processor-dependent mapping rules apply that would apply in the absence of this map option.
I like Gerrit's solution. Sounds not too hard to implement. Opinions?
Is an extension to cast-content-type to treat binary as text if needed something to be considered as well?
Once I ran into it it looked like an oversight we can't get from binary to xml.
I agree that it would be nice to use p:cast-content-type
to attempt to parse an “other” document as XML, HTML, text, or JSON. Haven’t we discussed this already?
I don't recall if we discussed the p:cast-content-type
issue or not, but yes, I think attempts to cast binary to XML, HTML, text, or JSON, should succeed if the actual octets can be parsed into that format given their actual content type (which may include an encoding, I fear).
The only hesitation I have about the map idea on p:unarchive
is that I think the problem it's solving applies more broadly than just p:unarchive
. I've had to extend the JVM content type list several times in the course of testing and demos. I would rather not have a solution that works for p:unarchive
but doesn't work for p:load
(for example). I wonder if the JVM content type solution works in p:unarchive
. Seems like it should, but I haven't tested it.
See: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8039362
@ndw The solution for p:unarchive
as proposed (with the map) is IMHO necessary only for p:unarchive
because p:unarchive
loads a whole bunch of files in one go.
The situation for p:load
is rather different: You load a single file and the step already has a content-type
option to override the automatic type guessing.
I created a new issue #365 for further discussion of the extensions to p:cast-content-type
.
This issue is now for discussing an optional extension to p:unarchive
only.
I agree that the details of p:unarchive
vs p:load
are different but they're also similar. If I get URIs from some other step, it may not be straightfoward to specify the override content type. In fact, if what's being loaded is some external, exploded archive, then the use case might even be exactly the same.
If we can avoid having to have two (or more) different mechanisms for things that are closely related or the same, that would be better. IMHO.
FYI: In MorganaXProc-IIIse 0.9.1.6 I implemented the following solution to the original problem with p:unarchive:
Option parameters
now accepts an entry for mox:content-types'
which has to be associated with an instance of array{array()). Therefore you can call p:unarchive
like so:
<p:unarchive include-filter="\S+\.rels"
parameters="map{'mox:content-types' :
[
['\S+\.rels', 'application/xml']
]
}" xmlns:mox="http://www.xml-project.com/morgana">
<p:with-input href="some-file.docx" />
</p:unarchive>
For each entry in the array an array with cardinality 2 is expected: The first entry is taken as a regular expression, the second entry is taken as a content-type. The array is inspected in order: If the path of the unarchived document matches the regex, the given content-type is taken to interpret the unarchived document. No other regex is considered. If no regex is matched, processors standard association applies. The implementation is clearly inspired by @gimsieke suggestion in his comment, but I went for arrays instead of a map. The basic advantage is that arrays have order, so they could be inspected in turn while in a map, I have to inspect ALL and raise an error if more than one is matched. Additionally it allows users to specify lists of regex which increasing levels of granularity.
Another question is whether the regexes are implicitly anchored, that is, whether you need to specify a regex that matches the file name front-to-back. This seems to be your approach (\S+\.rels
), while in mine you can specify \.rels$
. I think any specification or implementation should document whether matching is auto-anchored or not.
It probably should be interpreted in the same way as the include-filter
and exclude-filter
regexes, of which we also don’t say whether they are auto-anchored or not, at least when used on p:unarchive
.
For p:directory-list
, our examples imply that they are not auto-anchored, that is, you can also specify a regex that matches parts of the relative path.
I think we should unify all these filters on p:unarchive
and p:directory-list
:
$path
, not just the base names of archive/directory members.^…$
.@gimsieke Please mind that all antries in option parameters
of p:unarchive
are implementation defined. So what I stated above is an info about my solution and not a proposal to anything.
No, but we could turn it into a proposal and add it (using a different, none Morgana specific name) to the spec, right?
Would this be more useful globally or on a per-p:unarchive
basis?
There is a (I think) low impact solution to this: Add a content-type
option to p:unarchive
. If you specify this it will treat all unarchived files as this content type. Then unarchive the files one by one or in batches with the same content-type. What you need up-front for this is some idea of the content types of the archive's entries. See #368
We take Achim's idea and add an option to both p:unarchive
and p:archive-manifest
to specify content type overrides.
@gimsieke About not anchoring the filename matching regexps:
Can we state that the regexp matching works like the XPath matches()
function (with 2 parameters)?
matches(path-to-match, regexp)
I think so
Do we need a check on the value of a content-type
(the seconds members of the inner arrays) or is this just a string and can you specify [ ['.rels$', ' what a nonsense '] ]
?
If we need checking, what would be the definition of a content-type?
There is already an error for specifying an invalid content type (XC0130), coming from p:http-request
. Maybe we can re-use this?
If so, maybe there are more places where we need to add something about this (p:cast-content-type
?)
This will be covered by an error listed in the core spec.
Done with #379, but the debate is not finished, maybe some additional changes are necessary.
Debate finished and follow up in #385. CLosing this issue for now.
I ran into a use-case for which we have no elegant solution in XProc 3.0 but is quite common. @gimsieke , I think this is rather important for transpect too!
MS Office files (like
.docx
and.xlsx
) are zip files with all kinds of XML inside. Some have filenames like.rels
. Since this is not a standard XML file extension,p:unarchive
does not recognize it as XML and loads it as binary. But you need the XML to interpret the office format correctly.The base cause is of course that we load the contents of an archive in one go using
p:unarchive
. There is no mechanism to specify the content types of the individual files, you have to leave that to the processor which does its best but is sometimes wrong.So how can we solve this? I see several solutions:
p:cast-content-type
that forces it to view a binary file as text so you can convert it to XML after its loaded.p:unarchive
to help the processor guessing the file types. Maybe just acontent-type
attribute that forces all unarchived files to this type and then usep:unarchive
to load the files one-by-one or in controlled sets usinginclude-filter
andexclude-filter
.Actually, my preference would be to implement both 2 and 3. Both are useful in their own way and probably other use-cases benefit also.