Cannot read compressed files using multithreading

ptallada commented 8 years ago

A sample CSV file:

1,23.9
2,5.6

compressed using hadoop-xz yields:

$ hexdump -C out/000000_0.xz 
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  01 00 0c 31 2c 32 33 2e  |....t/.....1,23.|
00000020  39 0a 32 2c 35 2e 36 0a  00 00 00 00 dc 43 5b 17  |9.2,5.6......C[.|
00000030  b5 3f 3f e0 00 01 25 0d  71 19 c4 b6 1f b6 f3 7d  |.??...%.q......}|
00000040  01 00 00 00 00 04 59 5a                           |......YZ|
00000048

Same file, compressed manually using a single core yields the same result

$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -c | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  01 00 0c 31 2c 32 33 2e  |....t/.....1,23.|
00000020  39 0a 32 2c 35 2e 36 0a  00 00 00 00 dc 43 5b 17  |9.2,5.6......C[.|
00000030  b5 3f 3f e0 00 01 25 0d  71 19 c4 b6 1f b6 f3 7d  |.??...%.q......}|
00000040  01 00 00 00 00 04 59 5a                           |......YZ|
00000048

But compressing it using multiple threads yields somewhat different file that haddop-xz is unable to read and fails with a:

java.io.IOException: XZ Stream Footer is corrupt

$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -T0 -c | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 04 c0 11 0d  |.7zXZ......F....|
00000010  21 01 16 00 00 00 00 00  00 00 00 00 88 88 cd 68  |!..............h|
00000020  01 00 0c 31 2c 32 33 2e  39 0a 32 2c 35 2e 36 0a  |...1,23.9.2,5.6.|
00000030  00 00 00 00 dc 43 5b 17  b5 3f 3f e0 00 01 2d 0d  |.....C[..??...-.|
00000040  79 93 1d 7e 1f b6 f3 7d  01 00 00 00 00 04 59 5a  |y..~...}......YZ|
00000050

trixpan commented 8 years ago

:+1:

I've seen this issue as well.

Worth noting that multi threading xz has been implemented in xz > ~~5.0~~ 5.2

shawjef3 commented 6 years ago

Forcing a newer version of https://search.maven.org/search?q=g:org.tukaani%20AND%20a:xz&core=gav does not fix this.

nikita-volkov commented 6 years ago

Seems like this stems from the XZ library under the hood. From it's official page:

Single-threaded streamed compression and decompression and random access decompression have been fully implemented. Threading is planned but it is unknown when it will be implemented.

zanglang commented 5 years ago

Glancing at the code and at https://tukaani.org/xz/xz-file-format-1.0.4.txt I suspect this is not related to multithreaded decompression, but bad handling in io.sensesecure.hadoop.xz.XZSplitCompressionInputStream.nextStreamOffset() when the optional "Compressed Size" field is present in the initial block header. At first glance, xz only includes it when enabling multithreads ('80' in the first block), and there is a different code path when the size is not found.

$ xz -1e -T2 -k /tmp/allmsg
$ head -c16 /tmp/allmsg.xz | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 04 c0 80 9c  |.7zXZ......F....|
$ xz -1e -T1 -k /tmp/allmsg
$ head -c16 /tmp/allmsg.xz | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|

This chunk looks pretty suspicious:

    // Stream Footer
    inData.readFully(streamFooterBuf);
    if (streamFooterBuf[10] == XZ.FOOTER_MAGIC[0] && streamFooterBuf[11] == XZ.FOOTER_MAGIC[1] && DecoderUtil.isCRC32Valid(streamFooterBuf, 4, 6, 0)) {
        throw new IOException("XZ Stream Footer is corrupt");
    }

    long streamOffset = ((Seekable) seekableIn).getPos();
    ((Seekable) seekableIn).seek(offset);
    return streamOffset;

yongtang / hadoop-xz

Cannot read compressed files using multithreading #9