wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

lazy array attributes #27

Open jreback opened 7 years ago

jreback commented 7 years ago

IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.

e.g. imagine a pd.date_range(....., ...), then unique, monotonic, has_nulls are trivial to compute at creation time. Since this is currently an Index in pandas it is immutable by-definition.

xref https://github.com/pydata/pandas/issues/12272, https://github.com/pydata/pandas/issues/14266

chris-b1 commented 7 years ago

API question - what does it look like to opt-in to one of these checks? As a specific example, I've used this "optimization" a few times to speed up merges on a monotonic column.

a.merge(b, on='sorted_col')

# takes advantage of monotonicity
a.set_index('sorted_col').join(b.set_index('sorted_col'))

What should that look like? Could be something like this, although maybe should be even more hidden as "advanced api" to avoid too many parameters on basic functions?

a.merge(b, on='sorted_col', check_monotonicity=True)

check_monotonicity= {'infer' | True | False}
wesm commented 7 years ago

Things like monotonicity are so cheap to check and provide such significant performance benefits when they are known, that I would support always checking when it may be advantageous.

These attributes can be cached and invalidated whenever the array is mutated (we'd have to have a "dirty" flag to indicate that any cached array statistics need to be recomputed)

llllllllll commented 7 years ago

Regarding immutabity: What should happen if a user creates a series from an immutable array, and then later sets the array to mutable and mutates it. I think a valid answer is "don't do that", but it should be explicitly defined. If that should be supported behavior you could forward checks to immutable down to the underlying storage's check each time. The small indirection shouldn't be too expensive but idk if you can cache that.

wesm commented 7 years ago

@llllllllll when you create a pandas.Series from an pandas.Array you are actually obtaining a view on that array, so if the source array mutates itself, it triggers copy-on-write (because it observes that it's use count is > 1). So this will be a non-issue.