wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

Dataframe immutability option in pandas 2.x #75

Closed bivald closed 3 years ago

bivald commented 6 years ago

Hi,

I'm not sure the proper way to give feedback to the design phase of pandas 2.x, feel free to move this elsewhere. I know that immutable dataframes are out-of-scope for pandas 1.x (https://github.com/pandas-dev/pandas/issues/16567), but I would love to see this feature in pandas 2.x.

The background is when using low latency computation frameworks such as Dask Distributed (https://github.com/dask/distributed/) more people (myself included) are starting to view Dask+Pandas as a real-time query engine that rivals and surpasses most traditional databases in several use cases. However, Dask using Pandas is therefore not immutable: You can have a helper function which accidentally alters the data and corrupts your in-memory storage.

More background on the use case on https://stackoverflow.com/questions/50017443/read-only-pandas-dataset-in-dask-distributed

Right now the options are basically:

  1. Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
  2. Devise a way to hash a dataframe, which takes roughly 5 seconds on a 1-2GB dataframe (using hash_pandas_object) which makes it too expensive to run after each query (otherwise you could use this to detect mutations and reset your in-memory data.

But would love for it in Pandas 2.x to have an option for immutability. I know it's not a simple task though.

Regards, Niklas

wesm commented 6 years ago

hi Niklas -- well "pandas2" likely won't be called "pandas2", but the goal is to have copy-on-write data frames and lazy copying in the future, so you can create a "shallow" copy of a DataFrame that does not perform any memory copying until mutation occurs.

Note I've just created an organization to raise money to work on these problems (see https://ursalabs.org/tech/) -- I don't know how long it will take to see things built, but at the current rate it's likely to take some years.

bivald commented 3 years ago

Closing this due to age, feel free to open it up again if you want to