wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

DEV: Reusing low level constructs in libarrow / Apache Arrow #44

Closed wesm closed 8 years ago

wesm commented 8 years ago

As I've been prototyping I've copied over a bunch of C++ code from Arrow (https://github.com/apache/arrow) — I'm not sure maintaining near clones of the same code in two places makes sense (see @xhochy comment here https://github.com/pandas-dev/pandas2/commit/b982d96eb1c4c5d2c38a95635944f4bcf7e04de1#commitcomment-19406430).

The code in question is:

Sharing this code means adding libarrow as a build / runtime dependency — if this causes problems in some way, we can absorb the bits of the library that are being used in pandas. We should definitely set using aliases so that we are not using the arrow:: namespace directly in the code for these low level bits.

Later, we can also potentially take advantage of arrow::io, a small IO subsystem for dealing with files, memory maps, etc. This may be useful for revamping the CSV reader.

When we look at adding nested data types to pandas, or even a new string array type, we may want to consider using the Arrow memory layout, so having this in the build toolchain may make life easier in a number of ways.