Specification for override content type

xatapult commented 4 years ago

Issue #364 Added error for invalid regular expressions. Will apply this error to other steps where applicable as of #378

ndw commented 4 years ago

What is the "relative filename of the archive's entry"? Don't you just mean the filename of the archive's entry? I assume that if the archive filename is /path/to/my/thing.xml, I can write a regex that matches on /path/to/my/ for example. Or did we decide otherwise?

xatapult commented 4 years ago

AFAIK we decided to work on the relative filename in the archive, so it will never start with a /. This is also how its done in p:unarchive, for the include and exclude filters:

Names of entries in archives are always relative names. For instance, the name of a file called xyz.xml in a specs subdirectory in an archive is called in full specs/xyz.xml (and not /specs/xyz.xml).

Whatever it is, it is worth making this clear (I tried by providing an example). It's always a nuisance these kinds of details.

ndw commented 4 years ago

I'm fine with saying that we create archives with "relative" names, but I think in matching on unarchive, we should say we match against the actual filename in the directory entry. We shouldn't be tampering with it before applying the regex.

xatapult commented 4 years ago

I don't follow. We're not "tampering". We're just defining exactly what string the regexp will match against. And whether that is with or without a leading /, I don't care, it's just a choice. As long as its defined unambiguously it's OK as far as I'm concerned.

I don't see any reason to change the decision we made on doing it against filenames that do not start with /, again, just a choice. But if you do, let's discuss this the next call. If we decide to change this we should IMHO do it everywhere, also in the include and exclude filters of p:unarchive.

@gimsieke , @xml-project Any opinions?

gimsieke commented 4 years ago

I didn’t completely understand what kind of file name tampering Norm was afraid of, and I don’t think that Erik suggested any tampering. He was pointing out that all filters match against relative paths, without a leading slash. To my surprise, we seem to have gotten this consistent across the archive/unarchive and directory-list steps.

ndw commented 4 years ago

All I'm saying is that we don't need to say anything about "relative paths" in this step. This step should match against the path specified in the archive, exactly as it appears in the archive. If someone used some tool that sticks a leading "/" on the archive filename, that should be part of what's matched against the regex. We shouldn't be removing leading slashes if they occur before we do the comparison.

xatapult commented 4 years ago

All I'm saying is that we don't need to say anything about "relative paths" in this step

Yes, we should!!! If you take an archive as the root of a file system, you need to define exactly how an entry in the archive is referenced.

If someone used some tool that sticks a leading "/" on the archive filename

I really don't follow. A file in an archive has a path leading up to it. How can you stick a slash in front?

Aren't we confusing the full path (the path of the archive itself + the path of the entry in the archive) with what we're dealing with here? What I mean is only the part in the archive.

gimsieke commented 4 years ago

Even if someone managed to prepend a slash to each entry, this slash will be disregarded by p:unarchive and p:archive-manifest as per the two list items in front of this message. So if we say that matching will be performed against the string in c:entry/@name (if someone requested an archive manifest), it will have no leading slash.

ndw commented 4 years ago

I'm not sure what relative-to has to do with it. I'm also surprised that we seem to be talking past each other. Here's a zip file I have on my disk:

$ unzip -v example.zip
Archive:  example.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       4  Defl:N        6 -50% 04-01-2020 16:05 7e3265a8  /tmp/x/readme
--------          -------  ---                            -------
       4                6 -50%                            1 file

I would expect this override-content-type to match:

[ [ '^/.*$', 'text/plain' ]]

Coming back to relative-to, suppose it's file:///home/ndw/. Then, off the top of my head, I expect that to work like this:

import java.net.URI
relative_to = URI.create("file:///home/ndw/")
relative_to.resolve("/tmp/x/readme")

And that returns file:/tmp/x/readme.

Granted, leading slashes are uncommon, a bad idea, and almost always wrong, but how is my analysis incorrect?

gimsieke commented 4 years ago

relative-to doesn’t matter here as much as: “and with the relative path of this document as it was in the archive as first parameter”. In the resulting manifest your /tmp/x/readme will always appear as <c:entry name="tmp/x/readme"/>. This is because once you have the full URI, you need to strip away the leading relative-to or the leading slash of the path portion as per the two items that precede this other message.

I just noticed that what I tacitly assumed here is that before stripping away relative-to, a slash will be appended if there is no trailing slash. This is the behavior that p:urify() requires for its 2nd ($basedir) argument.

We need to clarify this:

If you supply relative-to="file:///home/ndw", the full URI will be file:///home/ndw/tmp/x/readme.
If you supply relative-to="file:///home/ndw/", the full URI will also be file:///home/ndw/tmp/x/readme.

So c:entry/@name of the resulting archive manifest should be tmp/x/readme in both cases. Therefore we need to say that the relative-to, potentially with a slash appended, will be subtracted from the full URI.

Alternatively, we can say that c:entry/@name for 1. will be /tmp/x/readme and for 2. it will be tmp/x/readme. But I think this alternative is not consistent with stripping away the leading slash for c:entry/@name in the following case:

If there is no relative-to option or if its value is not the initial substring of the base URI of the document, the name is the path portion of the URI (per [RFC 3986]). If the path portion begins with an initial slash, that slash is removed.

xatapult commented 4 years ago

Maybe it's just the term "relative" that's confusing?

We can change that in: "the path leading up to the file in the archive, without a leading /"

xatapult commented 4 years ago

I'm going to merge this PR (because its blocking another PR I want to make) but leave the issue open. I suppose the debate is not finished yet.

ndw commented 4 years ago

@gimsieke writes:

If you supply relative-to="file:///home/ndw", the full URI will be file:///home/ndw/tmp/x/readme.

No, it won't. That is not how the RFCs define relative URI resolution. We should not be introducing magic that changes the established rules!

Also, if the filename in the archive is "/tmp/x/readme", the resolved value will be file:///tmp/x/readme because that's also how URI resolution works.

gimsieke commented 4 years ago

I understood that URIs of extracted archive members should always (also for members with absolute paths) be archive_uri/relative_path. At least I remembered that this was our decision “because that’s how Java etc. form URIs for archive members” (Erik said that).

This means that URI resolution needs to happen

after a leading slash is stripped from the entry name and
after a slash is appended to the archive URI (or to the relative-to string) if there isn’t already a trailing slash

These rules make sure that an extracted member /tmp/x/readme of an archive file:/home/ndw/example.zip will get the URI file:///home/ndw/example.zip/tmp/x/readme (instead of file:///tmp/x/readme), which is indiscernible from the URI that a member with the name tmp/x/readme will have after extraction.

I’m not saying that I want it to work this way, but I think this was the (maybe not fully informed wrt edge cases) consensus.

If this rule holds, it also means that unarchiving followed by archiving might not be idempotent: A member that had the name /tmp/x/readme will have the name tmp/x/readme after re-creating the archive (because “when ZIP archives are processed, every name in the manifest must be a relative path without a leading slash”, a statement that might also solicit further discussion).

If we go with “an extracted archive member /tmp/x/readme will have the URI file:///tmp/x/readme instead of file:///home/ndw/example.zip/tmp/x/readme, at least in the absence of relative-to”, then, in order not to render relative-to useless for absolute paths, we still need to stipulate that when relative-to is given, URI resolution still needs to happen like this:

after a leading slash is stripped from the entry name and
slash is appended the relative-to string, unless already present

Otherwise, as I said, relative-to will be useless if entry names are absolute paths.

xproc / 3.0-steps

Specification for override content type #379