zlib-ng / minizip-ng

Fork of the popular zip manipulation library found in the zlib distribution.
Other
1.25k stars 434 forks source link

Incorect compressed size in the local header with 4+ GB files and data_descriptor = 0 #543

Closed Eswcvlad closed 3 years ago

Eswcvlad commented 3 years ago

I've been testing zipping of big files (basically just full on random content files) and noticed an issue with compressed size values in the Zip64 extra field in the local header when streaming mode is disabled.

The issue is this:

  1. At the very start at entry open the uncompressed size is big, but at that point compressed size is zero. So in the Zip64 extra field in the local header there is only one field with the uncompressed size.
  2. At the end at entry close now both uncompressed and compressed sizes are > UINT32_MAX. The values in the base local header fields get updated to UINT32_MAX, but there is no expected value for the compressed size in the Zip64 extra field entry.

Also there is a similar twin issue, when uncompressed size is less than UINT32_MAX, but compressed size is bigger. The result is similar, but now without the Zip64 extra entry in the local header at all.

So with that we have a proper central directory header, as we have all of the info at that point, but it is impossible to get the compressed size from the local header. This breaks some of the unzipping tools.

I've made a sloppy attempt to fix it here: https://github.com/Eswcvlad/minizip/commit/3b01a375ea0b9bb48a5ffdb9d24eee119be69893 .

The fix boils down to the following:

  1. In case there is a Zip64 extra field entry created, uncompressed and compressed sizes are both written. Judging from the PKWARE zip spec (4.5.3) it should be this way in the local headers anyway.
  2. At entry closing now both the base compressed size and the compressed size in the Zip64 is updated.

It seems to be working, but it is sloppy, affects streaming mode and I am not sure if it is save for other cases (like archive splitting).

Could you have a look at this?

nmoinvaz commented 3 years ago

When adding a file entry, set the uncompressed_size field of file_info and it will pre-determine if it needs zip64 or not.

nmoinvaz commented 3 years ago
  1. PKWARE zip spec 4.5.3 says "but the fields MUST only appear if the corresponding Local or Central directory record field is set to 0xFFFF or 0xFFFFFFFF". So only 64-bit compressed_size will appear if local header's uncompressed size value is -1, and only 64-bit uncompressed size will appear if local header's uncompressed size is -1. In fact, I think the code used to be such that it would automatically force all of them to be 64-bit, but now it does it according to the spec and only writes them IF their associated value in the local header is -1.
Eswcvlad commented 3 years ago

In my case I need to do without streaming/data descriptor and just have the full sizes in the local header.

I am passing the uncompressed size and it detects it properly. The issue is that if the compressed size will also not fit in 32-bit in the end this won't be visible in the local header: it will just have 0xFFFFFFFF in the base field and no corresponding entry in the Zip64 extra field.

For the spec there is later "This entry in the Local header MUST include BOTH original and compressed file size fields.", so I would assume it means having both 0xFFFFFFFF and put both in the extra field, even if they fit. Though the spec somewhat contradicts itself on Zip64, so I am not sure whether my assumption is correct or not...

nmoinvaz commented 3 years ago

@Eswcvlad please review the change I have made.

Eswcvlad commented 3 years ago

@nmoinvaz I am guessing the stream_buf changes is the reason I wasn't able to read from that stream before?

Tested locally on my test suite and I've done some bugfixes here: https://github.com/Eswcvlad/minizip/commit/b322d3c904ae5a10d2c5253643a73f2d196855f3 . With this it seems to work properly on win/mac/linux. It also seems to be fine with data descriptor enabled too. The disk_offset change most likely is not required, but I've made it anyway as the entry in the local header confused zipdetails.

Additionally two other related tweaks, that might be useful: https://github.com/Eswcvlad/minizip/commit/e2e375dd1f2b56cc7529e0f3044226180f1e2a7b https://github.com/Eswcvlad/minizip/commit/5093eb6f16a5c48dfce94288403d488b1f98e8b5

nmoinvaz commented 3 years ago

I think with regard to _diskoffset the part I was previously missing was the !local check here: https://github.com/nmoinvaz/minizip/blob/123c49e94edd144d6ec2d1fc29fc2f4ee8de6a07/mz_zip.c#L794-L795

nmoinvaz commented 3 years ago

@Eswcvlad Yes, it is the reason. I'm hoping the change I made works. Originally the buffered stream was only meant for either read or write (not both). Time will tell.

nmoinvaz commented 3 years ago

All the changes should be made now.

Eswcvlad commented 3 years ago

@nmoinvaz the order of fields is still incorrect here: https://github.com/nmoinvaz/minizip/blob/3afaad4d9ddd8bcc6af8c6f4c7e25478de703bc2/mz_zip.c#L2181-L2186

nmoinvaz commented 3 years ago

I have made a change to fix the ordering. However, I am not sure how I feel about the change to the posix stream to w+b. And perhaps wondering if there is a better way to do the zip64 offset calculation that doesn't require reading.

nmoinvaz commented 3 years ago

If we always write ZIP64 extension first in extrafield, then it will be possible to calculate fixed offset much easier. What do you think @Eswcvlad?

Eswcvlad commented 3 years ago

Seems good. I'll run my tests later today to recheck just in case.

Eswcvlad commented 3 years ago

https://github.com/nmoinvaz/minizip/commit/d183a4061f1f9302cc656373db8fe9d3e5322179 broke dir handling, as directories may appear after the 4 GB mark and would require Zip64 for the disk offset.

Otherwise seems to work fine.

nmoinvaz commented 3 years ago

Thanks. I have made a fix to that which will only turn it off zip64 for directory entries in local headers. Central directory headers should still have zip64 for directories due to the issue you mentioned.

Eswcvlad commented 3 years ago

I've tested the changes and it still has issues. Since in the end when we update the Zip64 data in the local header, we base it on the needs_zip64 at the central directory on entry close.

So, currently, for directories outside the 4 GB offset there won't be an entry in the local header, but there will be in the central header and the code for updating the local header will overwrite some other extra field or data, as there is no Zip64 extra field there. Somewhat opposite issue with the cushion: in that case it will instead just not update the entry in the local header in case cushion wasn't necessary.

Ideally it would be better to cache whether Zip64 was used in the local header or not (or just the offset to the Zip64 extra field at that point) and use that during the local header update on entry close. Or try to keep it so it is never the case, that there is no Zip64 in local header (i.e. entry open), but it is present in the central header (i.e. on entry close).