yt-project / yt

Main yt repository
http://yt-project.org
Other
461 stars 276 forks source link

save_bitmasks and load_bitmasks require h5py #4690

Open brittonsmith opened 11 months ago

brittonsmith commented 11 months ago

Bug report

The save_bitmasks and load_bitmasks functions in yt/geometry/particle_oct_container.pyx import h5py. A bare install of yt (i.e., from pip) won't install h5py. My understanding is that we have not made h5py a hard dependency as it was only needed for specific frontends. However, it is now necessary for core functionality. Should we consider making this a hard dependency now? Either that, or perhaps another way to install that makes sure to capture this?

neutrinoceros commented 11 months ago
python -m pip install "yt[HDF5]"

is the supported way to install with h5py out of the box. Maybe we should advertise it more (currently I don't think it's documented at all).

My personal position hasn't changed since the last time this was discussed : I'd rather not make h5py a hard dependency because it's been (historically) very slow to add support new versions of Python and having them as a hard requirement would impact our own reactivity.

brittonsmith commented 11 months ago

Yeah, I had no idea about these installation options until I went looking in pyproject.toml. h5py is required for 20 out of 37 frontends and now implicitly for every particle frontend. This feels like a core dependency to me. Take all that out and I'm not sure how much functionality can be used without h5py. It may tie us down a bit, but it is effectively so already.

neutrinoceros commented 11 months ago

Some packages have a [recommended] extra target to deal with this kinda semi-obligatory dependencies (for instance astropy[recommended] gets you matplotlib). I think I would fancy having both h5py and ipywidget there (it's currently a hard dependency but is actually useless if you don't use IPython or Jupyter)

matthewturk commented 11 months ago

I recall being somewhat opposed to having h5py as a core dependency, but I think with the addition of the particle frontends requiring it we should upgrade it and make it required.

neutrinoceros commented 6 months ago

I should point out that this has been a very stressful year since the inception of numpy 2.0 (currently in beta), and it's already been pretty hard to stay on top of its breaking changes in the various libraries I maintain. I absolutely don't mean to blame h5py maintainers, who are doing the community an immense service with very little resources, but I observe that in the current state of the library, any package that depends on h5py cannot[^1] be tested against numpy dev, because h5py nightlies are lagging behind and are currently ABI incompatible with numpy 2.0. If we had a hard dependency on h5py, most of the work I did in to get ahead of user-visible failures in unyt / ewah-bool-utils / yt this past few months wouldn't have been possible, and it'd be much harder to solve the very same problems now just because there would be so many of them at once.

From my standpoint, making h5py a hard dep brings very little gain, and would come at the very high cost of making my work almost impossible. So... I don't want to play the "it's me or it" card, but I can anticipate that having h5py has a hard dep could kill my will to keep the boat afloat.

[^1]: unless they are willing to include building and monkeypatching h5py as part of their testing process, which is probably even harder than it sounds to start with.

edit: forgot the footnote

matthewturk commented 6 months ago

This is a reasonable point to make. The numpy 2.0 transition has brought up a lot of issues, and keeping ahead of them has likely saved a ton of difficulty when it actually comes out, and thank you for staying ahead of them.

There are other projects using h5py as a hard dependency -- do we have any informal info on how they are managing? And, is this something that the "scientific python" project would be interested in looking at? My recollection is that it is a community umbrella for issues like this.

So let's go back to thinking about how we might address h5py not being a hard dependency. If this is the only place we have it, I see two options:

A huge advantage of using h5py for bitmasks is that we are able to store additional metadata without necessarily breaking things. Stuff like the bounding box, for instance, as well as a couple other bits that are quite useful. We would have to give that up if we eliminated h5py completely from the bitmask operation, or we'd have to at least have to re-implement something along those lines within the files.

(It's not obvious to me that we couldn't, though. For instance, SDF files used to mix strings as headers with binary after a delimiter, and based on our IO patterns something might be feasibly done here as well.)

Perhaps to stay ahead of this, we could make serialization of bitmasks optional, and simply not done if h5py isn't installed?

neutrinoceros commented 6 months ago

There are other projects using h5py as a hard dependency -- do we have any informal info on how they are managing?

Speaking for myself, the only lib I maintain that has it as a hard dep has a small enough community of users that it doesn't matter much that I can't fully test it at the moment. From what I can tell, anyone who has h5py as a dependency has to wait for them to catch up. Again, no offense to h5py maintainers, they're having it much worse than we are.

And, is this something that the "scientific python" project would be interested in looking at? My recollection is that it is a community umbrella for issues like this.

as far as I can tell, that's mostly a discussion to be had between Numpy people and h5py people. Numpy devs are actually quite reasonable in the expectations they set for the community and it looks like they're willing to accommodate their release schedule to big players' response, but I haven't seen such a discussion yet (maybe it's happening behind the curtain or I'm just not looking in the right places ?).

Perhaps to stay ahead of this, we could make serialization of bitmasks optional, and simply not done if h5py isn't installed?

For the record this isn't a part of yt that I'm familiar with, so anything I say in this space should be taken with a grain of salt. That said, I don't think it's a problem that some yt APIs would just raise an error if h5py isn't installed (as is currently the case). While not ideal, it's an easily discoverable and fixable issue for users (just install h5py).

In other words, I'm okay with some fraction of the library having a hard requirement on h5py, I just think it shouldn't be 100%.

I realize that I may not be seeing the gains as clearly as the costs, and it's possible that the future is actually much brighter than the recent past, in which case most of my argument doesn't hold, but I'm not willing to take a bet just yet.

(and for the record, I'm sorry I gave this conversation such an unpleasant twist. My hope is to avoid a much more unpleasant discussion down the road.)

matthewturk commented 6 months ago

So the place that this would show up as a big regression for users is in any loading of particle-based datasets in a system without h5py, in which case the index will be reconstructed every time. That may have been a big issue some time ago, when there were lots of non-hdf5 particle datasets, but it's entirely possible it no longer is as many/most have migrated to using hdf5.

jzuhone commented 6 months ago

Let's face it--whichever way we go on this, the reality for probably most of our users is going to be that they are not going to be able to upgrade to NumPy 2.0 unless they install h5py by hand. Because I would imagine that most of them are using frontends that depend on it. As @matthewturk pointed out, most of the Gadget-type frontends use it now, FLASH, Enzo, Athena++, etc.

This is not an argument for requiring it as a dependency, but it is an argument for warning our users about the issue.

neutrinoceros commented 6 months ago

Good point. I'll include it it yt 4.3.1´s release notes !