wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

Paratext <-> Apache Arrow bridge #55

Open wesm opened 7 years ago

wesm commented 7 years ago

@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).

The simplest thing would be to fork the codebase into a libarrow_csv shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow inside parquet-cpp (https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?

deads commented 7 years ago

Hi @wesm, This sounds like a very interesting idea. The paratext reader is still missing a lot of functionality that's available in pandas.read_csv so I imagine it will take some work to flesh out the feature matrix before deprecating read_csv. Full DateTime support and reading arbitrary objects will be a lot of work to get the details right. The chunking features and use_cols in Pandas should be much easier. Calling Python functions in multi-threaded code is deadly so a read_csv feature that takes in a Pure Python function will be problematic. Let's discuss further in person.