Closed xatapult closed 4 years ago
What is the "relative filename of the archive's entry"?
Don't you just mean the filename of the archive's entry? I assume that if the archive filename is /path/to/my/thing.xml
, I can write a regex that matches on /path/to/my/
for example. Or did we decide otherwise?
AFAIK we decided to work on the relative filename in the archive, so it will never start with a /. This is also how its done in p:unarchive
, for the include and exclude filters:
Names of entries in archives are always relative names. For instance, the name of a file called xyz.xml in a specs subdirectory in an archive is called in full specs/xyz.xml (and not /specs/xyz.xml).
Whatever it is, it is worth making this clear (I tried by providing an example). It's always a nuisance these kinds of details.
I'm fine with saying that we create archives with "relative" names, but I think in matching on unarchive, we should say we match against the actual filename in the directory entry. We shouldn't be tampering with it before applying the regex.
I don't follow. We're not "tampering". We're just defining exactly what string the regexp will match against. And whether that is with or without a leading /
, I don't care, it's just a choice. As long as its defined unambiguously it's OK as far as I'm concerned.
I don't see any reason to change the decision we made on doing it against filenames that do not start with /
, again, just a choice. But if you do, let's discuss this the next call. If we decide to change this we should IMHO do it everywhere, also in the include and exclude filters of p:unarchive
.
@gimsieke , @xml-project Any opinions?
I didn’t completely understand what kind of file name tampering Norm was afraid of, and I don’t think that Erik suggested any tampering. He was pointing out that all filters match against relative paths, without a leading slash. To my surprise, we seem to have gotten this consistent across the archive/unarchive and directory-list steps.
All I'm saying is that we don't need to say anything about "relative paths" in this step. This step should match against the path specified in the archive, exactly as it appears in the archive. If someone used some tool that sticks a leading "/" on the archive filename, that should be part of what's matched against the regex. We shouldn't be removing leading slashes if they occur before we do the comparison.
All I'm saying is that we don't need to say anything about "relative paths" in this step
Yes, we should!!! If you take an archive as the root of a file system, you need to define exactly how an entry in the archive is referenced.
If someone used some tool that sticks a leading "/" on the archive filename
I really don't follow. A file in an archive has a path leading up to it. How can you stick a slash in front?
Aren't we confusing the full path (the path of the archive itself + the path of the entry in the archive) with what we're dealing with here? What I mean is only the part in the archive.
Even if someone managed to prepend a slash to each entry, this slash will be disregarded by p:unarchive
and p:archive-manifest
as per the two list items in front of this message. So if we say that matching will be performed against the string in c:entry/@name
(if someone requested an archive manifest), it will have no leading slash.
I'm not sure what relative-to
has to do with it. I'm also surprised that we seem to be talking past each other. Here's a zip file I have on my disk:
$ unzip -v example.zip
Archive: example.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
4 Defl:N 6 -50% 04-01-2020 16:05 7e3265a8 /tmp/x/readme
-------- ------- --- -------
4 6 -50% 1 file
I would expect this override-content-type
to match:
[ [ '^/.*$', 'text/plain' ]]
Coming back to relative-to
, suppose it's file:///home/ndw/
. Then, off the top of my head, I expect that to work like this:
import java.net.URI
relative_to = URI.create("file:///home/ndw/")
relative_to.resolve("/tmp/x/readme")
And that returns file:/tmp/x/readme
.
Granted, leading slashes are uncommon, a bad idea, and almost always wrong, but how is my analysis incorrect?
relative-to
doesn’t matter here as much as: “and with the relative path of this document as it was in the archive as first parameter”. In the resulting manifest your /tmp/x/readme
will always appear as <c:entry name="tmp/x/readme"/>
. This is because once you have the full URI, you need to strip away the leading relative-to
or the leading slash of the path portion as per the two items that precede this other message.
I just noticed that what I tacitly assumed here is that before stripping away relative-to
, a slash will be appended if there is no trailing slash. This is the behavior that p:urify()
requires for its 2nd ($basedir
) argument.
We need to clarify this:
relative-to="file:///home/ndw"
, the full URI will be file:///home/ndw/tmp/x/readme
.relative-to="file:///home/ndw/"
, the full URI will also be file:///home/ndw/tmp/x/readme
.So c:entry/@name
of the resulting archive manifest should be tmp/x/readme
in both cases. Therefore we need to say that the relative-to
, potentially with a slash appended, will be subtracted from the full URI.
Alternatively, we can say that c:entry/@name
for 1. will be /tmp/x/readme
and for 2. it will be tmp/x/readme
. But I think this alternative is not consistent with stripping away the leading slash for c:entry/@name
in the following case:
If there is no relative-to option or if its value is not the initial substring of the base URI of the document, the name is the path portion of the URI (per [RFC 3986]). If the path portion begins with an initial slash, that slash is removed.
Maybe it's just the term "relative" that's confusing?
We can change that in: "the path leading up to the file in the archive, without a leading /"
I'm going to merge this PR (because its blocking another PR I want to make) but leave the issue open. I suppose the debate is not finished yet.
@gimsieke writes:
If you supply
relative-to="file:///home/ndw"
, the full URI will befile:///home/ndw/tmp/x/readme
.
No, it won't. That is not how the RFCs define relative URI resolution. We should not be introducing magic that changes the established rules!
Also, if the filename in the archive is "/tmp/x/readme", the resolved value will be file:///tmp/x/readme
because that's also how URI resolution works.
I understood that URIs of extracted archive members should always (also for members with absolute paths) be archive_uri/relative_path
. At least I remembered that this was our decision “because that’s how Java etc. form URIs for archive members” (Erik said that).
This means that URI resolution needs to happen
relative-to
string) if there isn’t already a trailing slashThese rules make sure that an extracted member /tmp/x/readme
of an archive file:/home/ndw/example.zip
will get the URI file:///home/ndw/example.zip/tmp/x/readme
(instead of file:///tmp/x/readme
), which is indiscernible from the URI that a member with the name tmp/x/readme
will have after extraction.
I’m not saying that I want it to work this way, but I think this was the (maybe not fully informed wrt edge cases) consensus.
If this rule holds, it also means that unarchiving followed by archiving might not be idempotent: A member that had the name /tmp/x/readme
will have the name tmp/x/readme
after re-creating the archive (because “when ZIP archives are processed, every name in the manifest must be a relative path without a leading slash”, a statement that might also solicit further discussion).
If we go with “an extracted archive member /tmp/x/readme
will have the URI file:///tmp/x/readme
instead of file:///home/ndw/example.zip/tmp/x/readme
, at least in the absence of relative-to
”, then, in order not to render relative-to
useless for absolute paths, we still need to stipulate that when relative-to
is given, URI resolution still needs to happen like this:
relative-to
string, unless already presentOtherwise, as I said, relative-to
will be useless if entry names are absolute paths.
Issue #364 Added error for invalid regular expressions. Will apply this error to other steps where applicable as of #378