xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
169 stars 44 forks source link

Consistency between DataTree methods and pathlib.PurePath methods #283

Closed TomNicholas closed 3 weeks ago

TomNicholas commented 10 months ago

@eschalkargans suggested in #281 that the API of DataTree objects could more closely follow that of pathlib.PurePath objects. I think this aligning of APIs/nomenclature is a good idea. In general think it's conceptually useful to think of a DataTree object as if it were an instance of pathlib.PurePosixPath (even though the actual implementation should not work like that).

There are various methods we might want to add/change to make them more compatible:

Inspired by pathlib.PurePath:

Inspired by pathlib.Path (i.e. concrete paths):

Several of these might be useful abstractions internally, especially .joinpath, .walk, and .replace.

EDIT: Let's also document this similarity:

etienneschalk commented 10 months ago

Hi @TomNicholas , I would like to help with the code on this one. Do you think this might be a good first issue? Thanks!

TomNicholas commented 10 months ago

Sure @etienneschalk! I think each of these bullet points is really it's own little issue, so feel free to open a PR for any one of them. (Maybe leave the tree-walking related ones for now though because I think those will be a little more complicated.)

TomNicholas commented 10 months ago

Once we have completed some of these it would also be nice to add a little section in the documentation that points out this similarity explicitly to users. Also we can then reorganise the grouping of methods in api.rst to have a section for Path-like methods.

etienneschalk commented 7 months ago

Pathlib

The following are some notes I taken while reading the pathlib documentation, thinking about equivalences in DataTree usage

Listing

Curated list

This list only contains methods I did not classified as "Irrelevant". The "Irrelevant" tag is subjective to my understanding, I may have missed important methods

Pure Paths

Concrete Paths

Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.

Full list

#### Pure Paths - `PurePath.parts` - "parsed" path - `PurePath.drive` **Irrelevant** - Irrelevant for `PurePosixPath` implementation of `PurePath` - `PurePath.root` - Relevant to differentiate between absolute and relative paths. This is already done by `PurePath.is_absolute()` - For `DataTree.root`, same comment as `parents` - Note: `root` = `parents[-1]`? No, currently the parents are rewinded until finding a parent with `root is None`. Could it be simplified with `parents[-1]`, if the path hierarchy is already known in advence? - `PurePath.anchor` **Irrelevant** - drive + root = same as root for PurePosixPath = irrelevant - `PurePath.parents` - The `DataTree.parents` should use the paths obtained via its `NodePath` identifier inside of the root's `DataTree` to produce the list of parents' DataTree. - Note: this means all Nodes must be aware of the root. Which is the case via the `root` attribute. Trees are aware of being a root or a subtree. - `PurePath.parent` - Same comment as `parents` - Note: `parent == parents[0]`? - `PurePath.name` - Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs - `PurePath.suffix` **Irrelevant** - `PurePath.suffixes` **Irrelevant** - `PurePath.stem` **Irrelevant** - `PurePath.as_posix()` **Irrelevant** - `PurePath.as_uri()` **Irrelevant** - `PurePath.is_absolute()` - Interesting, as Node IDs should be absolute. - `PurePath.is_relative_to_other()` - Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup? - `PurePath.is_reserved()` **Irrelevant** - `PurePath.joinpath` **Irrelevant** for end user - Cannot see the immediate utility for a end user, might be useful internally - `PurePath.match` - This is a "single-element" version of glob, only checking if a single path conforms to the pattern - Might be useful to implement `DataTree.glob` by mapping it against all paths contained in the tree. - `PurePath.relative_to(_other_, _walk_up=False_)` - Might be useful to detach a node from a tree, to generate its new paths identifiers. - `PurePath.with_name(_name_)` - Might be useful to rename a node and updating its path representing it inside of its root DataTree. - `PurePath.with_stem(_stem_)` **Irrelevant** - Irrelevant (same reason as `stem`, there is no concept of extension in DataTree paths) - `PurePath.with_suffix` **Irrelevant** for same reason - `PurePath.with_segments(*pathsegments)` - Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath #### Concrete Paths Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance. - `Path.cwd()` **irrelevant** - `Path.home()` **irrelevant** - `Path.stat()` **irrelevant** - `Path.chmod()` **irrelevant** - `Path.exists()` **irrelevant** - Can be used to determine if the path is contained in the bound instance of DataTree - `Path.expanduser()` **irrelevant** - `Path.glob()` - Can be used to map `PurePath.match` against all paths contained by the bound instance of `DataTree` - Regarding `case_sensitivity`, since DataTree works with PurePosixPath, keep the default POSIX config: `True` - `Path.group()` **irrelevant** - `Path.is_dir()` - It might be useful to discriminate between `DataTree` and `Dataset` (directory-like) and `DataArray` (file-like)) - Maybe a better name like `is_group` could help, or `is_aggregation` - Note: `Dataset` may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates) - `Path.is_file()` - Mirrors `path.is_dir()` - Maybe a better name like `is_dataarray` could help, or `is_leaf` - `Path.is_junction()` **irrelevant** - `Path.is_mount()` **irrelevant** - `Path.is_symlink()` - To be considered if symbolic nodes are to be implemented - `Path.is_socket()` **irrelevant** - `Path.is_fifo()` **irrelevant** - `Path.is_block_device()` **irrelevant** - `Path.is_char_device()` **irrelevant** - `Path.iterdir()` - Like `ls` - `Path.walk` - A good candidate method to implement to explore a `DataTree` - Introduced in `Python 3.12` only - Currently, from developer point of view, using `Path.rglob("*")` when needing to iterate through a directory, so maybe `walk` is dispensable. - `Path.lchmod` **irrelevant** - `Path.lstat` **irrelevant** - `Path.mkdir` - Probably irrelevant, but kwargs like `parents=True`, `exist_ok` might be useful when working with groups. - `Path.open` **irrelevant** - `Path.owner` **irrelevant** - `Path.read_bytes` **irrelevant** - `Path.read_text` **irrelevant** - `Path.readlink` **irrelevant** - `Path.rename` - Might be useful to rename a node inside of the root tree - `Path.replace` - Similar to `Path.rename` for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic. `replace` is more "expeditive" than `rename`, as if a path already exists it will be surely replaced. - `Path.absolute()` - Can be useful for browsing the DataTree - `Path.resolve()` - Similar to `absolute`, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree - `Path.rglob` - Similar to `Path.glob`, with the `**` prefix. Depends on developer's taste - `Path.rmdir` - To remove an entire subtree from the tree? Might be useful in conjunction with `relative_to` - `Path.samefile` - I cannot see an utility rn - `Path.symlink_to` - To be considered if symbolic links are to be implemented in DataTree - `Path.hardlink_to` **Irrelevant** ? - `Path.touch` - Create an empty DataArray at that location? - `Path.unlink` - The naming might be confusing to work with `DataTree`. - `Path.write_bytes` **Irrelevant** - `Path.write_text` **Irrelevant**

Ideas

Ideas of question for a FAQ. A FAQ is a powerful documentation format, it is used for instance in the ruff documentation: https://docs.astral.sh/ruff/faq/ The idea is to answer as quickly as possible as the seamingly mundane questions for someone knowing the tool, but not immediate at all for someone starting to use it

See https://github.com/pydata/xarray/blob/fffb03c8abf5d68667a80cedecf6112ab32472e7/xarray/datatree_/datatree/datatree.py#L425

@property
def parent(self: DataTree) -> DataTree | None:
TomNicholas commented 3 weeks ago

Closing in favour of https://github.com/pydata/xarray/issues/9448 upstream.