Open ptallada opened 8 years ago
:+1:
I've seen this issue as well.
Worth noting that multi threading xz has been implemented in xz > 5.0 5.2
Forcing a newer version of https://search.maven.org/search?q=g:org.tukaani%20AND%20a:xz&core=gav does not fix this.
Seems like this stems from the XZ library under the hood. From it's official page:
Single-threaded streamed compression and decompression and random access decompression have been fully implemented. Threading is planned but it is unknown when it will be implemented.
Glancing at the code and at https://tukaani.org/xz/xz-file-format-1.0.4.txt I suspect this is not related to multithreaded decompression, but bad handling in io.sensesecure.hadoop.xz.XZSplitCompressionInputStream.nextStreamOffset() when the optional "Compressed Size" field is present in the initial block header. At first glance, xz only includes it when enabling multithreads ('80' in the first block), and there is a different code path when the size is not found.
$ xz -1e -T2 -k /tmp/allmsg
$ head -c16 /tmp/allmsg.xz | hexdump -C
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 04 c0 80 9c |.7zXZ......F....|
$ xz -1e -T1 -k /tmp/allmsg
$ head -c16 /tmp/allmsg.xz | hexdump -C
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 |.7zXZ......F..!.|
This chunk looks pretty suspicious:
// Stream Footer
inData.readFully(streamFooterBuf);
if (streamFooterBuf[10] == XZ.FOOTER_MAGIC[0] && streamFooterBuf[11] == XZ.FOOTER_MAGIC[1] && DecoderUtil.isCRC32Valid(streamFooterBuf, 4, 6, 0)) {
throw new IOException("XZ Stream Footer is corrupt");
}
long streamOffset = ((Seekable) seekableIn).getPos();
((Seekable) seekableIn).seek(offset);
return streamOffset;
A sample CSV file:
compressed using hadoop-xz yields:
Same file, compressed manually using a single core yields the same result
But compressing it using multiple threads yields somewhat different file that haddop-xz is unable to read and fails with a: