xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
162 stars 43 forks source link

Consistency between `__contains__` and `__getitem__` #240

Open djhoese opened 1 year ago

djhoese commented 1 year ago

I'm working on adding conversion to a DataTree object in the Satpy library and while writing some tests I've come across a quality of life change I'm hoping is possible/allowed to add. Basically, I want to be able to check if a node is in a tree in the same way I can access that node using the / syntax. For example:

In [1]: from datatree import DataTree

In [3]: a = DataTree.from_dict({"a/b/c": None})

In [4]: a["a/b"]
Out[4]:
DataTree('b', parent="a")
└── DataTree('c')

In [5]: "a/b" in a
Out[5]: False

I expected (with very limited prior usage/knowledge of DataTree) that the __contains__ method would do the same or similar traversal as the __getitem__ method. In the above case the in line would return True.

Is there a functional reason why this can't be implemented this way? Any issues or gotchas you can foresee if I were to implement it (performance maybe?)?

Edit: Oops forgot to include the tree creation line.

TomNicholas commented 1 year ago

Hi @djhoese - this is a neat idea that seems reasonable to me!

Implementing this for checking for child nodes might be as simple as changing this line to read something like

return key in self.variables or child.path for child in self.descendants

but I think a slightly more complicated iteration would be needed to also return variables within child nodes.

Another gotcha would be ensuring this works for relative paths (i.e. if the node you were searching from in your example was not the root node of the tree).

Probably we want some internal convenience property like dt._list_all_descendents_and_variables. There's some crossover with ipython key completions too, xref https://github.com/xarray-contrib/datatree/pull/217.

Otherwise I don't see why there would be a performance issue, it's the same kind of traversal as asking to list the descendants.

If you are interested in submitting a PR then I will happily review it!

TomNicholas commented 1 year ago

Thinking about it a bit more this is a case where it's not immediately clear whether to make a datatree object behave more like a one-level dict of children and variables, or something more complex that allows access to multiple levels. At the moment I've made this distinction by basically trying to make most methods that would be defined on dict (or at least on a MutableMapping) work similarly on DataTree. Your suggestion would move away from this correspondence, but I think that would be fine from a usability perspective...

djhoese commented 1 year ago

it's not immediately clear whether to make a datatree object behave more like a one-level dict of children and variables, or something more complex that allows access to multiple level

I had a similar thought. I was torn on whether or not I even wanted to file this issue for discussion. I think the thing that sells this idea for me is that __contains__ and __getitem__ should really work together. You often see __contains__ being implemented as a simple try/except KeyError on __getitem__.

The sub-node/child handling of DataTree.__getitem__ is very useful so it is hard to argue that is shouldn't exist. The __setitem__ usage it equally convenient. I think, in my opinion of course, if __getitem__ and __setitem__ both support sub-node handling then __contains__ should too. That said, I don't think it would make sense or be as useful (compared to get/set/contains) for .keys/items/values to include multiple levels. That would maybe be better served by something like a DataTree.traverse() or iterate_tree() or something similar that traverses the entire tree (breadth or depth first, not sure).

I just did a test and it looks like DataTree.get() doesn't support the / syntax. With that in mind and reading the docstring for it whee it talks about nodes versus variables versus coordinates...maybe it would be better to remove the / support in getitem/setitem? It seems odd for .get and __getitem__ to have different features/functionality, right?

This is just brainstorming. I'm not trying to enforce or suggest I know what is best design for DataTree.

TomNicholas commented 1 year ago

This is just brainstorming. I'm not trying to enforce or suggest I know what is best design for DataTree.

No thank you @djhoese I appreciate these questions - this is one of those things that I internally debated about with datatree's design but ultimately just picked something without really having to justify it to anyone :sweat_smile: At the very least it's useful to write out the logic here in case someone else has a similar question later.

The sub-node/child handling of DataTree.getitem is very useful so it is hard to argue that is shouldn't exist.

I agree. It's one of the main things that elevates DataTree beyond literally being just a dict of dicts. It also makes the filesystem analogy work, which IMO makes it much easier for users to understand the whole design.

I also tried hard to ensure that even though __get/setitem__ supports multiple levels then it still behaves the same way that a one-level dict would. That means even if you didn't know that __getitem__ supports multi-level access, the fact it does is unlikely to cause you a problem.

That said, I don't think it would make sense or be as useful (compared to get/set/contains) for .keys/items/values to include multiple levels.

This is pretty much exactly what I decided too. I think when someone calls .items they are thinking of the object as a dict, implying they want a one-level iteration.

That would maybe be better served by something like a DataTree.traverse() or iterate_tree() or something similar that traverses the entire tree (breadth or depth first, not sure).

So far my traversal properties are mostly named things like .descendants, but it's the same idea.

I just did a test and it looks like DataTree.get() doesn't support the / syntax. With that in mind and reading the docstring for it whee it talks about nodes versus variables versus coordinates...maybe it would be better to remove the / support in getitem/setitem? It seems odd for .get and __getitem__ to have different features/functionality, right?

Yes, but we could just as easily argue that .get should support /. Or argue that it's the "square bracket methods" which are special, and so it's fine for .get and .__getitem__ to support different features.

(Also there is precedent for .__getitem__ having features beyond .get - that's what xarray.Dataset already does. Xarray's API reference explicitly lists Dataset's "dictionary interface" but it also allows array indexing with square bracket syntax.)


I think the thing that sells this idea for me is that __contains__ and __getitem__ should really work together. You often see __contains__ being implemented as a simple try/except KeyError on __getitem__.

I think I agree here. You mean like

if path in dt:
    node = dt[path]

I wonder if there is an argument that checking for existence can safely be more "permissive" than the values returned by iteration... i.e. that succeeding in accessing multi-level paths is unlikely to surprise the user compared to returning the entire contents of a tree from .items... This seems unconvincing though :man_shrugging:

djhoese commented 1 year ago

I wonder if there is an argument that checking for existence can safely be more "permissive" than the values returned by iteration... i.e. that succeeding in accessing multi-level paths is unlikely to surprise the user compared to returning the entire contents of a tree from .items... This seems unconvincing though man_shrugging

I'm not sure I understand what you're saying here. I think you're restating your agreement that keys/values/items should not traverse, but getitem/setitem/__contains__ should.

Just as an added example, what I meant when I said contains and getitem should "work together" is that they are complimentary methods in my opinion. In my original case I was writing a test:

tree = to_xarray_datatree(...)
assert "a/b/c" in tree
node = tree["a/b/c"]

Now some might argue that any if key in obj keys should be try/except KeyError, but I've never really liked the ugliness of that. But I think at a higher level we're talking about "is this thing in this tree? let me get that thing" which is broken if "is this thing in this tree? No, OK...why not...wait, yes it is".

TomNicholas commented 4 months ago

Sorry for forgetting to reply to this for almost a year 😅 I was reminded of this by a different discussion and now I think I do agree with you.

I think at a higher level we're talking about "is this thing in this tree? let me get that thing" which is broken if "is this thing in this tree? No, OK...why not...wait, yes it is".

This especially seems like a good reason for __contains__ to support path-like syntax.

I also think that it's convenient that we can separate the "is this key in the local node's variables" question from the "is this path in the tree anywhere" question. In the former case the key will never have a / character and the latter case it will always have one, so we could perhaps think of this like overloading the __contains__ method to behave slightly differently for str vs PathLike types.

djhoese commented 4 months ago

No problem. I'm on paternity leave now so it was lucky I even saw your message (plus I subscribe to every event in this repository so sometimes I miss conversations I'm a part of).

Good points and I think I agree.