Open alimanfoo opened 7 years ago
To follow up on this, here are a few potted thoughts. To be clear, I don’t have any significant bandwidth to work on this, and so I don’t expect anything to happen in the short term, just mulling some ideas.
An important question is, should zarr provide some kind of columnar table functionality. I.e., is it worth it, and who would use it, and what for? Especially given the traction of parquet, what would/could zarr do that cannot easily be done via parquet? Also, should zarr try to replicate the functionality of bcolz so it could serve as a replacement and so combine maintenance efforts? And there are plenty of other big players in columnar data storage, with way more time and money than we are likely to be able to muster. Some things I think are relatively special about zarr are (1) flexibility to layer on top of a range of storage systems, including distributed storage; (2) support for parallel read and write, with support for locking at various levels down to individual data chunks; (3) support for a range of fast compressors and filters; (4) relatively easy to get under the hood because of pure Python implementation. Are these features unique and/or valuable enough for columnar data applications to make the effort worthwhile? If so, what are the driving use cases?
Assuming it is worth doing, then what is the architecture? In particular, to what extent does the implementation build on the existing Group/Array classes? Could the new functionality be layered entirely on top of the existing architecture, or is any new fundamental architecture required?
In the work @jreback did (#84), a new top-level “Frame” class was created alongside Group and Array, with its own metadata key (‘.zframe’). An alternative “layered” approach would be to construct Frame as a sub-class of Group, where each child Array is treated as a column and is constrained to be 1-dimensional and have the same length.
The layered approach is appealing because it could allow maximal re-use of existing functionality, and potentially could avoid the need for any changes to the storage specification. However, it would be worth exploring the performance implications of a layered architecture.
In particular, I expect an important use case would be appending rows (or blocks of rows) to an existing table. If using a purely layered architecture, then each column (Array) would maintain its own metadata. Appending to a table would require updating the metadata for each Array to set a new shape. If a table had a large number of columns, and if some kind of network-based storage (e.g., S3) was being used, then the network latency involved in updating all these metadata files could impact on performance. Also, if any of these update operations failed, the table could be left in an inconsistent state, with different columns having different lengths.
It might be possible to get the best of both worlds, i.e., take a layered approach, but introduce some workarounds for the metadata consistency issue above. For example, we could add a Frame class as a sub-class of Group, but use special attributes to store some additional metadata. In particular, the current length of the table could be set as the value of an attribute. It would probably also be necessary to store the list of column names as an attribute, to allow columns to be put in a particular order by the user. Then we could add a Column class as a sub-class of Array. In the Column class we could override how the shape of the array is determined, which would be looked up via the parent Frame, rather than from the .zarray metadata. We could also prevent resize or append operations on the Column class, to force these to be done via the parent Frame instead.
As well as the shape, it would probably also be worth constraining the chunk length to be the same across all columns, as aligning chunks would improve performance for any chunked computation. As with the shape, this could be done by setting the chunk length as a special attribute of the parent Frame, and the column class then overrides Array to look up chunk length via its parent, rather than from its own metadata.
One question is whether, under this design, each column would still have it’s own .zarray metadata key? Each column would need to be able to have its own dtype, compressor and/or filter configurations, and so those would need to be stored somewhere. These metadata fields would not be expected to change, so if each column did have it’s own .zarray, it would only need to be read infrequently and could be cached. To avoid confusion/inconsistency, if each column did have a .zarray, it would probably be a good idea to set shape and chunks fields to null.
What would be the value of this kind of layered approach? It would mean that frames and columns would get recognised (as groups and arrays) by the existing (vanilla, non-frame-aware) machinery. At least that machinery would know they existed, and would not accidentally overwrite them if a user requested to create a vanilla group or array with the same name. But if columns were arrays with shape and chunks metadata set to null, then they could not be read by the vanilla array machinery.
The alternative would be to package up all metadata for the frame and all its columns into a single JSON resource, and store it under a special attribute. But this is almost the same as creating a new .zframe metadata key.
It feels like I’m slightly going round in circles here, but wanted to explore some options.
I'd propose that this could likely be migrated to zarr-specs if not a more design-y or perhaps community repo.
cc: @ivirshup in case there's interest.
I'd totally agree with that @joshmoore.
I think it's pretty easy to specify a dataframe/ table on top of zarr, and don't think it needs to be defined in the core zarr spec.
That said, table support could be improved by:
But these don't need tables to be in the main spec.
This issue is a placeholder for gathering thoughts and discussion on requirements for storage of columnar tables (a la bcolz). The idea is to explore how the functionalities currently available in zarr and bcolz might be brought together. This is not necessarily something that will happen in zarr, just an attempt to pull together thoughts and discussion at this stage.
Requirements
Below is a list of possible requirements. If you have any thoughts or comments on the requirements, please add a comment below, including if requirements should be non-requirements (e.g., if the requirement can be achieved by using zarr with dask).
from_zarr()
function into dask.dataframe, which would allow out-of-core dask computations against the zarr stored data.