zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.5k stars 278 forks source link

Rethinking Zarr's core dependencies #2391

Open jhamman opened 4 days ago

jhamman commented 4 days ago

I'd like to open the conversation about what Zarr's core dependencies are for 3.0. Currently, this looks like:

https://github.com/zarr-developers/zarr-python/blob/11312534ebe683d73cbbcc2da9e88933cb00cc14/pyproject.toml#L25-L34

Some of these are not used anymore (asciitree and fasteners) so those can safely go.

Then there is fsspec and crc32c. These are only needed for the RemoteStore and ShardingCodec, respectively. What do we think about making these optional?

One proposed diff in our dependencies would look something like:

 dependencies = [
-    'asciitree',
     'numpy>=1.25',
-    'fasteners',
-    'numcodecs>=0.10.2',
-    'fsspec>2024',
-    'crc32c',
+    'numcodecs>=0.12',
     'typing_extensions',
     'donfig',
 ]

 [project.optional-dependencies]
+remote = [
+    "fsspec",
+]
+sharding = [
+    "crc32c",
+]

Notes:

d-v-b commented 4 days ago

👍 this seems good to me.

dstansby commented 3 days ago

I think sharding is a big enough part of what zarr v3 promises, that it's worth having crc32c as part of the default dependencies. Looking at their files on PyPI the package is very light (~40kB), and it doesn't have any other requirements.

fsspec is also small (200kB), so I wonder if it's worth keeping default too so users don't have to jump through extra hoops to open remote arrays? Given a large use case of zarr is a format for large data > a lot of the time users will be accessing it remotely.

What are the reasons for removing these? Definitely open to considering it, but given they're lightweight deps at the moment I'm thinking we should keep them as default.

d-v-b commented 3 days ago

I think sharding is a big enough part of what zarr v3 promises, that it's worth having crc32c as part of the default dependencies. Looking at their files on PyPI the package is very light (~40kB), and it doesn't have any other requirements.

Is there a reason why we shouldn't put sharding in numcodecs? then the crc32c dependency would live there.

dstansby commented 3 days ago

👍 for that

jhamman commented 3 days ago

Here's my thought on fsspec. While I agree that the package dependency is not particularly large, it also don't come with batteries included -- you still need s3fs, gcfs, adlfs, etc to use the RemoteStore. I imagine we're all aligned on making keeping each of the individual implementations out of the required dependency tree. I guess my perspective is that if all of those are optional, and they all depend on fsspec, then we don't gain much by requiring fsspec.

@d-v-b and/or @dstansby - can one of you open an issue on crc32c in numcodecs?

dstansby commented 3 days ago

That makes sense to me on fsspec - would be good to add some docs if it's optional, I'll stick a request on https://github.com/zarr-developers/zarr-python/pull/2395.

I opened an issue for cr32c at https://github.com/zarr-developers/numcodecs/issues/610

normanrz commented 2 days ago

I also think that we should only drop crc32c as a core zarr dependency once it is part of numcodecs. It would suck if people had to install additional groups to be able to use sharding.