wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

DESIGN: Cheaper DataFrame.append #53

Open wesm opened 8 years ago

wesm commented 8 years ago

I'm thinking we can come up with a plan to yield a better .append implementation that defers stitching together arrays until it's actually needed for computations.

We can do this by having a virtual pandas::Table interface that will consolidate fragmented columns only when they are requested. Will think some more about this

shoyer commented 8 years ago

The obvious alternative is to allow pandas objects to backed by dynamic arrays. This is possible now that we require arrays to 1D and contiguous.

This has the advantage of still using eager evaluation, so you don't need to build machinery for differed evaluation. Also, you still get predictable performance, even if you inspect the array in between appends. I would guess looking at DataFrames being appended piece-by-piece is pretty common, even if only to check the size.

The downside is that this wouldn't really work with the current interface, because such appends need to in-place. Also, dynamic arrays reduce speed and increase memory requirements by small constant multiples.

Maybe it would make sense to deprecate DataFrame.append and instead make an alternative DynamicDataFrame (sub?)class that does an in-place append?

wesm commented 8 years ago

We could definitely have a mutating append and write into resizeable buffers (with growth factor 1.5 or 2). Something we can experiment with

jreback commented 7 years ago

related this this, I think enlargement via an indexer, a bit too magical / not-transparanet / non-performant, unless you have a growable buffer.

Better to be explicit at a small loss of convenience in syntax.