yt-project / yt

Main yt repository
http://yt-project.org
Other
461 stars 276 forks source link

Particle IDs of Gadget HDF5-style data are converted from integers to floats #4995

Open arkordt opened 5 days ago

arkordt commented 5 days ago

Bug report

Bug summary

An arepo snapshot whose on-disk ParticleIDs data type is uint32 or uint64 is converted to float32 or float64, respectively.

The conversion can be traced back to two locations. The first conversion is in yt/utilities/io_handler.py:209, the second conversion occurs in unyt/array.py:712 that is called from yt/data_objects/selection_objects/data_selection_objects.py:217. If both conversions are commented out, the expected output is produced.

I also submitted an issue to unyt because I am not sure where it would be better to avoid the type conversion.

Code for reproduction

yt-test-data should be the path where the YT test data is stored.

>>> import yt
>>> ds = yt.load('yt-test-data/ArepoBullet/snapshot_150.hdf5')
>>> ds.all_data()['PartType0', 'ParticleIDs']

Actual outcome

unyt_array([1710352., 1710346., 1710350., ..., 5721056., 5843308., 5843259.],
      dtype=float32, units='(dimensionless)')

Expected outcome

unyt_array([1710352, 1710346, 1710350, ..., 5721056, 5843308, 5843259],
      dtype=uint32, units='(dimensionless)')

Version Information

yt was installed via conda (version 4.3.1) or from source (main branch) for testing and bug tracing purposes.

welcome[bot] commented 5 days ago

Hi, and welcome to yt! Thanks for opening your first issue. We have an issue template that helps us to gather relevant information to help diagnosing and fixing the issue.

chrishavlin commented 5 days ago

I commented over in the unyt issue, but I do think that this is more of a yt issue than a unyt issue. IMO it'd make sense to handle integer, dimensionless quantities (like particleIDs) with a special case that avoids the type and unit conversion that is applied to physical fields.

neutrinoceros commented 5 days ago

The issue goes much deeper than this: many Cython routines in yt expect float64 as the dtype. Changing that would be a massive undertaking I think (with the possible exception of adding support for other float types, first and foremost float32). In my honest opinion this isn't a bug unless you go above the larger int that can be represented exactly as a float64 (which I think is 9_007_199_254_740_992). Is this actually creating an issue at runtime or is it only surprising ?

chrishavlin commented 5 days ago

I don't think it'd be too bad to preserve types just for read/selection operations though. Was just playing around with a possible implementation and it seems to work (can push up a draft in a bit) -- any operations beyond read/selection operations have implicit or explicit type conversions on the call to the cython routines.

arkordt commented 4 days ago

I had a look into the draft and at least from a user perspective, this seems like what I was expecting. Maybe I can add a bit more context how/why I encountered this issue and answering

... Is this actually creating an issue at runtime or is it only surprising ?

My simulation contains tracer particles and I was thinking about how to writing routines for/extending yt to work with them. The tracer particles are a separate particle type in the hdf5 file. The group has two datasets ParentID and TracerID. To get the number of tracers within a cell, I would need to count how many multiples of the same ID are in ParentID and then join these information to the gas cells (actually arepo writes NumTracers, an uint32 field, to the gas cells but initially, I somehow missed that). Retrieving the tracer coordinates, one would need to join the gas cell coordinates with the tracers with ParticleIDs==ParentID -- speaking with relational database operations. Currently, tracer particles of this storage type are ignored but this was a rather easy to change.

In this context, the matching operations on floats that represent integers would probably work because the number of digits in the largest particle id is less than the accuracy limit of floats. Additionally, there are no other mathematical operations on the ids, so rounding should not occur and an equality test of two float ids probably works, too. However, this is not guaranteed.

neutrinoceros commented 3 days ago

there are no other mathematical operations on the ids, so rounding should not occur and an equality test of two float ids probably works, too. However, this is not guaranteed.

Indeed it's not.

any operations beyond read/selection operations have implicit or explicit type conversions on the call to the cython routines.

I missed that. Then I agree it makes sense to at least explore a simple patch such as #4996. Let us know when it's ready for review !

chrishavlin commented 22 hours ago

added a bunch of tests and I'm not seeing any issues over in #4996 , just marked it as ready to review