wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

add support for .gz files #48

Open andytwigg opened 7 years ago

andytwigg commented 7 years ago

would be nice to add support for opening .gz files Ideally we could pass a file handle, eg

import gzip, paratext
with gzip.open(f, 'rb') as fh:
  paratext.read(fh)

It seems like the file handle is opened by the C code, so perhaps this is not practical, and easier to add gzip reading support directly to the C code?

deads commented 7 years ago

Thank you for the feature request and the suggestion. The way paratext is architected, using a Python file handle would require random access on the file. This is not easily achievable with the Lempel-Ziv algorithm on which gzip is based -- some files use a fixed dictionary in the header, but this is not true of all files. One would need to do a first sequential pass on the file to build the dictionary at different chunk start points. Then, the threads are spawned and start decompressing their respective chunks using each's respective reconstructed dictionary. We would welcome this contribution if someone wants to take a crack at it!