zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Is Zarr an OLAP Database? #290

Open alxmrs opened 8 months ago

alxmrs commented 8 months ago

I've been doing some background research for a project I've been working on. I came across the definition of an OLAP DB and OLAP Cube, and I can't help but see the similarities to Zarr.

https://en.wikipedia.org/wiki/OLAP_cube

Consider the operations section of this wikipedia page:

This makes me wonder if Zarr could be (mis)used as a traditional DB, say, to handle analytics and business use cases. Furthermore, maybe the literature around OLAP DBs could inspire improvements to Zarr as a format.

xref: https://github.com/alxmrs/xarray-sql/issues/47

d-v-b commented 8 months ago

I'm not too familiar with the OLAP conceptualization, but I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table".

With this in mind, then Zarr can be described as a "columnar database" with a few performance / storage optimizations for large columns, but it's not a database that competes with something like DuckDB, since Zarr doesn't have good support for variable-length types.

All that being said, if the world of databases is a superset of the world of N-dimensional arrays, then it's almost certainly the case that we can use tools from database theory / software to advance Zarr.

alxmrs commented 8 months ago

IIUC, OLAP is a way to represent ND Arrays in a RDBMS s.t. a range of common analytic queries are performant. These center around the snowflake or star DB schema pattern. But, I’m a DB newcomer, so take this explanation with a grain of salt.

I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table".

Funny you mention this! I’m exploring something similar here (in a read-only capacity): https://github.com/alxmrs/xarray-sql

since Zarr doesn't have good support for variable-length types.

Can you explain this a bit more? Would this be like string or varchar support?

it's almost certainly the case that we can use tools from database theory / software to advance Zarr.

Top of mind for me here is streaming reads and writes. I think some sort of rosetta stone with features in the Postgres ecosystem would really highlight potential new areas of development.

On Mon, Mar 18, 2024 at 3:16 PM Davis Bennett @.***> wrote:

I'm not too familiar with the OLAP conceptualization, but I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table".

With this in mind, then Zarr can be described as a "columnar database" with a few performance / storage optimizations for large columns, but it's not a database that competes with something like DuckDB, since Zarr doesn't have good support for variable-length types.

All that being said, if the world of databases is a superset of the world of N-dimensional arrays, then it's almost certainly the case that we can use tools from database theory / software to advance Zarr.

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/290#issuecomment-2003362908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARXABZE6KE7UXYWJJR574TYY2ZY3AVCNFSM6AAAAABE3B2YEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTGM3DEOJQHA . You are receiving this because you authored the thread.Message ID: @.***>

d-v-b commented 8 months ago

since Zarr doesn't have good support for variable-length types.

Can you explain this a bit more? Would this be like string or varchar support?

Zarr is designed for numeric types that are a fixed size, e.g. uint8; there's some effort towards supporting variable length strings in Zarr, but it's not a common use case. Compared to the types supported by a typical relational database, the numeric types Zarr focuses on is a tiny subset -- e.g., look at the types supported by postgres.