zip-rs / zip2

Zip implementation in Rust
Other
116 stars 40 forks source link

Large zip file cannot be read by Excel #100

Closed jmcnamara closed 6 months ago

jmcnamara commented 6 months ago

Thanks for taking over maintainership of zip.rs. This is a duplicate of an issue that I raised against the older version: https://github.com/zip-rs/zip-old/issues/388

I have a library call rust_xlsxwriter that uses zip.rs to create xlsx files.

The crate is reasonably well used and to date I haven't had any issues from Excel, either personally or from end users.

However, one user who is creating large (but not large_file large) 400MB files with lots of images reported an issue with the following file:

Bad file: basic-full.xlsx

Excel complains about this almost immediately when it tries to open it:

screenshot 1

I don't see any issues in the zipped XML data (the content) and if I unzip and rezip the file then Excel can open it:

unzip basic-full.xlsx -d basic-full

cd basic-full
find . -type f | xargs zip ../basic-rezipped.xlsx
cd ..

open basic-rezipped.xlsx

259741048-bbe3a64f-ac79-48aa-a7d3-af3027d14d0e

Good file: basic-rezip.xlsx

Also, a Microsoft Open XML SDK validation tool complains about the zip file when I try to open the bad file:

CleanShot 2023-08-11 at 21 14 43

So it looks to me like there is an issue with the zip container.

The bad file above was created with zip.rs v0.6.4 but I get the same result with v0.6.6. See the following for additional details: https://github.com/jmcnamara/rust_xlsxwriter/issues/51 Also, I'm second guessing that the issue is with large files. It could be related to the number of files in the zip: 87814.

The file was created with the following zip.rs options:

    pub(crate) fn new(writer: W) -> Packager<W> {
        let zip = zip::ZipWriter::new(writer);

        let zip_options = FileOptions::default()
            .compression_method(zip::CompressionMethod::Deflated)
            .unix_permissions(0o600)
            .last_modified_time(DateTime::default())
            .large_file(false);

        Packager { zip, zip_options }
    }

And the following dependency in Cargo.toml:

zip = {version = "0.6.4", default-features = false, features = ["deflate"]}

Turning on the large_file ZIP64 option doesn't help.

jmcnamara commented 6 months ago

Based on the analysis in https://github.com/zip-rs/zip-old/issues/388 the issue is that the DEFAULT_VERSION is 46 for bzip even if bzip isn't being used.

My proposed workaround was:

diff --git a/src/types.rs b/src/types.rs
index c3d0a45..e2f44d4 100644
--- a/src/types.rs
+++ b/src/types.rs
@@ -272,7 +272,10 @@ impl TryFrom<OffsetDateTime> for DateTime {
     }
 }

+#[cfg(feature = "bzip2")]
 pub const DEFAULT_VERSION: u8 = 46;
+#[cfg(not(feature = "bzip2"))]
+pub const DEFAULT_VERSION: u8 = 45;

 /// A type like `AtomicU64` except it implements `Clone` and has predefined
 /// ordering.

I can submit a PR for that if you think it is reasonable.

Pr0methean commented 6 months ago

Yes please; I'll have to look up what other software DEFAULT_VERSION affects. Thanks for migrating this issue and closing the original; it helps get the old repo ready for archiving.

Pr0methean commented 6 months ago

Ok, looks like we may need different version numbers according to the compression and encryption we're actually using for each file, not just what feature flags we're built with. See APPNOTE 4.4.3.2. For example, we'll need to set 51 for AES and 63 for LZMA.

Pr0methean commented 6 months ago

I've implemented this myself, with all the features for which APPNOTE 4.4.3.2 specifies a minimum version. (Not sure what the correct version is for Zstandard, because APPNOTE doesn't say.)

jmcnamara commented 6 months ago

Thanks for that. I'll rerun the test case for this later and let you know.

jmcnamara commented 5 months ago

Apologies, it took me a while to test this and unfortunately it doesn't fix the issue.

I'll go over the use case again for clarity. I maintain rust_xlsxwriter which is a crate for creating Excel xlsx files which are a collection of mostly xml files in a zip container. I use zip/zip2 for the zipping part of this. A user submitted a failing test case where they were trying to create an Excel file with ~67k images of Japanese Unicode characters. This exceeds the 64k zip file limit and requires zip64 support. zip.rs supports this correctly but uses a hardcoded value of 46 (Zip spec 4.6) where Excel expects to see 45.

If the follow DEFAULT_VERSION is 46:

https://github.com/zip-rs/zip2/blob/629623594d9681d50b1e82298d9d087bc04e83f1/src/write.rs#L1405-L1417

This causes the error shown above when Excel tries to load the file. If it is changed to 45 then everything works as expected.

In your patch c2fe207 you add enhanced logic to determine version_needed() but that isn't used in fn write_central_and_footer().

I tested with zip.rs v2.0.0 released today (well done).

Pr0methean commented 5 months ago

@jmcnamara I believe this will be fixed when e6b2290f70390e5df56d62d5efad4fa179268eb0 is released as part of version 2.1.0 tonight. Other commits that may be essential to the solution are cda4712153b629cbcdbeda87418c7a7319d27eb9, 0636bd74119e30e89399bd644f0e1239f419b6a0, f90bdf76b88a6218847b04d1813774f79251f41c (in rare corner cases) and 3afe549161cf769f845e50f8d8ec306bc76a8240. If I'm wrong, could you please reopen this issue and provide an example of a v2.1.0 or newer output that Excel fails to process?

jmcnamara commented 5 months ago

I believe this will be fixed when e6b2290 is released as part of version 2.1.0 tonight. Other commits that may be essential to the solution are cda4712, 0636bd7, f90bdf7 (in rare corner cases) and 3afe549.

@Pr0methean I don't think any of those fix the issue. The files that demonstrate this issue are too big (250MB) to upload to GitHub but there is an example in the first comment of this issue from the original reporters repo:

Bad file: basic-full.xlsx

The issue seems to be in the Central Directly End block of the file which for the file above is the following from zipdetails -vv:

CAA3A14 CAA3A17 0000004 50 4B 06 06 ZIP64 END CENTRAL DIR 06064B50 (101075792)
                                    RECORD
CAA3A18 CAA3A1F 0000008 2C 00 00 00 Size of record        000000000000002C (44)
                        00 00 00 00
CAA3A20 CAA3A20 0000001 2E          Created Zip Spec      2E (46) '4.6'
CAA3A21 CAA3A21 0000001 00          Created OS            00 (0) 'MS-DOS'
CAA3A22 CAA3A22 0000001 2E          Extract Zip Spec      2E (46) '4.6'
CAA3A23 CAA3A23 0000001 00          Extract OS            00 (0) 'MS-DOS'
...

Excel has an issue with the Extract Zip Spec 2E (46) '4.6' field. If I change that in a hex editor to 0x2D (Extract Zip Spec 2D (45) '4.5') the file will open (if the system has enough memory).

Note, the Created Zip Spec field doesn't have any effect on this.

jmcnamara commented 5 months ago

Update, maybe this is fixed on main. I was looking at a version of the code that was slightly out of date. Let me recheck...

jmcnamara commented 5 months ago

Sorry for the noise earlier. I reran with the latest main/v2.1.0 and it Excel can now open the file. For what it is worth here is the Central Directly End block:

CAF3AD7 0000004 50 4B 06 06 ZIP64 END CENTRAL DIR 06064B50
                            RECORD
CAF3ADB 0000008 2C 00 00 00 Size of record        000000000000002C
                00 00 00 00
CAF3AE3 0000001 2E          Created Zip Spec      2E '4.6'
CAF3AE4 0000001 00          Created OS            00 'MS-DOS'
CAF3AE5 0000001 14          Extract Zip Spec      14 '2.0'
CAF3AE6 0000001 00          Extract OS            00 'MS-DOS'
CAF3AE7 0000004 00 00 00 00 Number of this disk   00000000
CAF3AEB 0000004 00 00 00 00 Central Dir Disk no   00000000
CAF3AEF 0000008 06 57 01 00 Entries in this disk  0000000000015706
                00 00 00 00
CAF3AF7 0000008 06 57 01 00 Total Entries         0000000000015706
                00 00 00 00
CAF3AFF 0000008 A5 49 5C 00 Size of Central Dir   00000000005C49A5
                00 00 00 00
CAF3B07 0000008 32 F1 52 0C Offset to Central dir 000000000C52F132
                00 00 00 00

CAF3B0F 0000004 50 4B 06 07 ZIP64 END CENTRAL DIR 07064B50
                            LOCATOR
CAF3B13 0000004 00 00 00 00 Central Dir Disk no   00000000
CAF3B17 0000008 D7 3A AF 0C Offset to Central dir 000000000CAF3AD7
                00 00 00 00
CAF3B1F 0000004 01 00 00 00 Total no of Disks     00000001

CAF3B23 0000004 50 4B 05 06 END CENTRAL HEADER    06054B50
CAF3B27 0000002 00 00       Number of this disk   0000
CAF3B29 0000002 00 00       Central Dir Disk no   0000
CAF3B2B 0000002 FF FF       Entries in this disk  FFFF
CAF3B2D 0000002 FF FF       Total Entries         FFFF
CAF3B2F 0000004 A5 49 5C 00 Size of Central Dir   005C49A5
CAF3B33 0000004 32 F1 52 0C Offset to Central Dir 0C52F132
CAF3B37 0000002 00 00       Comment Length        0000

Separate question, in this case should the Extract Zip Spec flag be 45 since there are more that 64k files or does that rely on also setting SimpleFileOptions.large_file(true)?