zip-rs / zip2

Zip implementation in Rust
Other
94 stars 32 forks source link

Zip files are parsed incorrectly when using data descriptors #178

Open mmastrac opened 5 years ago

mmastrac commented 5 years ago

I'm using zip-rs indirectly through ripgrep-all and it's panicking while attempting to read certain jar files created with older versions of Java with "The file length is not available in the local header".

Attached an example of such a jar (renamed to .zip for github reasons)

retrace.zip

mvdnes commented 5 years ago

That is because the file is appareantly read in streaming mode (without the ability to seek) and the file length is not available before the start of the compressed data. In such a case it is not possible to determine the file length and thus not possible to further read the archive.

For these kinds of archives, only normal reading mode (with seeking) is supported.

phiresky commented 4 years ago

I know this library is not actively maintained, but some of the users of ripgrep-all are hitting this since I don't want to or can't allow seeking, so I'm checking if this can be worked around without requiring seek and putting my thoughts here:

The error is thrown by

https://github.com/mvdnes/zip-rs/blob/e485cbf5768a3e44685c8e5da8126c541e04d01f/src/read.rs#L637

which is set by

https://github.com/mvdnes/zip-rs/blob/e485cbf5768a3e44685c8e5da8126c541e04d01f/src/read.rs#L587

According to this ZIP spec if that bit is set:

   4.3.9  Data descriptor:

        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes

      4.3.9.1 This descriptor MUST exist if bit 3 of the general
      purpose bit flag is set (see below).  It is byte aligned
      and immediately follows the last byte of compressed data.
      This descriptor SHOULD be used only when it was not possible to
      seek in the output .ZIP file, e.g., when the output .ZIP file
      was standard output or a non-seekable device.  For ZIP64(tm) format
      archives, the compressed and uncompressed sizes are 8 bytes each.

If the zip archive is written by something that couldn't seek when writing, instead of the compressed size and checksum being written before the compressed data, it is written after it, with references to where that metadata is located in the central directory, which is at the end of the file.

It seems like the data descriptor usually has a signature:

      4.3.9.3 Although not originally assigned a signature, the value 
      0x08074b50 has commonly been adopted as a signature value 
      for the data descriptor record.  Implementers SHOULD be 
      aware that ZIP files MAY be encountered with or without this 
      signature marking data descriptors and SHOULD account for
      either case when reading ZIP files to ensure compatibility.

Searching for that signature in the internet gives some interesting information - many programs see to assume it is always there.

So apart from seeking to the central directory, one way to solve this would be to start decompressing file data until the 0x08074b50 signature is found (or maybe until the compression stream ends, not sure if this is possible for Deflate). This is not perfect since that signature may be part of the stream itself (e.g. if another zip file is included in the zip file and not compressed (CompressionMethod::Stored)). Though the length and CRC field could be abused to check whether the header is valid and thus very probably not part of the data.

Plecra commented 3 years ago

Thanks for the follow up @phiresky . I'm interested in adding this API, but it does come with caveats.

Though the zip format theoretically supports streaming, this parsing mode is only correct in "streaming contexts"; When an archive is being read from a medium that supports seeking (like the filesystem), the central header is allowed to contradict the local headers. That is to say I agree with mvdnes that there's no archive where this parsing method would actually be appropriate.

However, there's clearly still a meaningful way to parse files without a length in their local header. I'll try adding it to read_zipfile_from_stream soon.

Plecra commented 1 year ago

The bug that the new label refers to is the useless error message that we're throwing up at the moment :) At a minimum, I would like to report the lack of support for general purpose bit 3.

I'm pretty sure Deflate can also indicate the end of the stream to us, and at least in those cases, we should attempt to use the descriptor. Adding this feature is pretty low priority for now though :)