Open mmastrac opened 5 years ago
That is because the file is appareantly read in streaming mode (without the ability to seek) and the file length is not available before the start of the compressed data. In such a case it is not possible to determine the file length and thus not possible to further read the archive.
For these kinds of archives, only normal reading mode (with seeking) is supported.
I know this library is not actively maintained, but some of the users of ripgrep-all are hitting this since I don't want to or can't allow seeking, so I'm checking if this can be worked around without requiring seek and putting my thoughts here:
The error is thrown by
https://github.com/mvdnes/zip-rs/blob/e485cbf5768a3e44685c8e5da8126c541e04d01f/src/read.rs#L637
which is set by
https://github.com/mvdnes/zip-rs/blob/e485cbf5768a3e44685c8e5da8126c541e04d01f/src/read.rs#L587
According to this ZIP spec if that bit is set:
4.3.9 Data descriptor:
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
4.3.9.1 This descriptor MUST exist if bit 3 of the general
purpose bit flag is set (see below). It is byte aligned
and immediately follows the last byte of compressed data.
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device. For ZIP64(tm) format
archives, the compressed and uncompressed sizes are 8 bytes each.
If the zip archive is written by something that couldn't seek when writing, instead of the compressed size and checksum being written before the compressed data, it is written after it, with references to where that metadata is located in the central directory, which is at the end of the file.
It seems like the data descriptor usually has a signature:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
Searching for that signature in the internet gives some interesting information - many programs see to assume it is always there.
So apart from seeking to the central directory, one way to solve this would be to start decompressing file data until the 0x08074b50 signature is found (or maybe until the compression stream ends, not sure if this is possible for Deflate). This is not perfect since that signature may be part of the stream itself (e.g. if another zip file is included in the zip file and not compressed (CompressionMethod::Stored)). Though the length and CRC field could be abused to check whether the header is valid and thus very probably not part of the data.
Thanks for the follow up @phiresky . I'm interested in adding this API, but it does come with caveats.
Though the zip format theoretically supports streaming, this parsing mode is only correct in "streaming contexts"; When an archive is being read from a medium that supports seeking (like the filesystem), the central header is allowed to contradict the local headers. That is to say I agree with mvdnes that there's no archive where this parsing method would actually be appropriate.
However, there's clearly still a meaningful way to parse files without a length in their local header. I'll try adding it to read_zipfile_from_stream
soon.
The bug that the new label refers to is the useless error message that we're throwing up at the moment :) At a minimum, I would like to report the lack of support for general purpose bit 3.
I'm pretty sure Deflate can also indicate the end of the stream to us, and at least in those cases, we should attempt to use the descriptor. Adding this feature is pretty low priority for now though :)
I'm using zip-rs indirectly through ripgrep-all and it's panicking while attempting to read certain jar files created with older versions of Java with "The file length is not available in the local header".
Attached an example of such a jar (renamed to .zip for github reasons)
retrace.zip