Generalize categorizes (keep state, etc.)

zerothi / sisl

Electronic structure Python package for post analysis and large scale tight-binding DFT/NEGF calculations

https://zerothi.github.io/sisl

Mozilla Public License 2.0

201 stars 60 forks source link

Generalize categorizes (keep state, etc.) #687

Open zerothi opened 8 months ago

zerothi commented 8 months ago

Describe the feature

Currently the categories are quite strict in the sense that they provide the functionality to check individual atoms.

However, there are cases where it could be valuable to retain state information in the form of

geometry from which it was categorized
index in the above geometry
neighbor information (when searching for neighbors)

A primary concern when trying to solve #202 is to reduce the overhead of doing the neighbor estimation twice, which is necessary when doing the categories.

Instead we might propose something like this:

a category is a base class that is only used to check stuff
when categorizing a Category, it should return a CategoryResult class specifying whether it succeeded (if res) and some information. I imagine something like:
```
 @dataclass
 class CategoryResult:
     category: Category
     info: PropertyDict # geometry, atom, [neighbors, ...]
     def __bool__(self): return not isinstance(self.category, NullCategory)
```
This approach is not too different from scipy optimizers. My main worry is the optional variables associated with the categories.
Are there other ways to do this?

zerothi commented 7 months ago

Here it would also be nice if one could count the number of categories a single atom is existing in. E.g. when extracting the rings (from #688), it might be useful to extract non-overlapping rings, in which case the initial categorization would return overlapping rings. Then one would need to figure out a single atom which is only in 1 ring. And then remove duplicates.

zerothi commented 7 months ago

Here are some more thoughts on how to generalize the categories. Currently the categories are looped on individual atoms, and then checked.

However,

When we want to deal with categories involving multiple atoms, we have to know the full geometry before making a categorization.
When doing composite categories with categories involving multiple atoms (say rings), then it might be valuable to also restrict the ring to be a combination of another set of categories, i.e. 5 C and 1 N atom. This could possibly be done by having a category which fixes a preset number of atoms with another category, since there are no logical symbols involving *, we could overload the Category.__mul__ to accept a category and an integer? This could be errorprone as order of operations would matter
```
# this would result in the (Ring & Z) * 5, which would never be true.
cat = CategoryRing(...) & CategoryZ(Z=6) * 5
```
Exactly how this would work I am not sure, it should be simple, and intuitive, ideas are welcome!
Categories should act on the complete geometry. So the interface would stay the same, yet the way it works would be changed. My idea would be that a categorization would always return a CategoryResult object, which stores the category it originates from, and a list of positive results. E.g. for a AtomZ it would return something like [[1], [2]]. The point of this would be that categories involving mulitple atoms could have a result index like [[1, 2, 3, 4], [7, 8, 9]]. The result object could then also hold additional information that is defined by the originating Category, this is partly inspired by the scipy, OptimizeResult which can be very versatile, but a bit hard to document. On the other hand, it might be simple enough for end-users since the additional information isn't always required knowledge (you just want the result arrays). This CategoryResult object would then have an __iter__ implementation that iterates the result index (as shown above). This should also do the equivalent for CompositeCategories, the result should encompass a result from the combined result of both categories. This might seem a bit dubious, however, I think it could be useful for the composites to contain the information on the nested categories. Consider for example the case where one wishes to extract the atoms only having 2 neigbors, and they must only be carbons, here it would be ideal if one could access a result from the neighbor result (it might contain the information on which atoms are its neighbors etc.) So we need a mechanism that lets one filter the result easily, since we can't push down the stored information from nested categories into a composite result (consider if there are two branches of categories, both with neighbor conditions, that would mean that there would be two (possibly different) additional information.
Hence it might be ideal to do:
```
neighbor = Neighbors(...)
cat = neighbor & AtomZ(6)
results = cat.categorize(geometry)
for result in results.select(neighbor):
    ...
```
this would only loop the positives of the composite category cat, but filter out the result array from the neighbor category (if the extra information is needed). If one is simply interested in the indices, one would simply do for result in results.
When dealing with composites we should, by default, force the categories to be fully expanded. I.e. each category in the Compostite should be parsed as though it didn't know of the other one. Hence one could do (same as above):
```
for result in results.B: # if we should name it `A`/`B` (as is done now), or `lhs`/`rhs`.
```
I am not sure if results.B would be too ambiguous with results.select(AtomZ(6))? Should it even be possible? I assume it could be useful to check for near misses with this functionality open, and it shouldn't contradict the methodology as they are separated.
Oh, and all of the above would deprecate the NullCategory which is actually kind of weird to deal with... So that would make things simpler.

Hmm, I might come up with some more, but here some discussion points, @pfebrer @ialcon comments?

ialcon commented 7 months ago

Hi @zerothi, I like very much the overall approach, and to me it looks quite logical and organized. I also like both the for result in results.select(neighbor): and for result in results.B: construction. I understand that "A" in your example would be the first category indicated in the composite creation, and "B" the second, right?

results.A --> neighbor
results.B --> AtomZ(6)

In this example what it is not fully clear to me is to what AtomZ(6) applies? To the targeted ia whose bonds are scrutinized (does ia have 2 bonds?) or to its neighbors? So, in other words, cat in cat = neighbor & AtomZ(6) would provide all carbon atoms having (say) 2 neighbors or it would provide all atoms having 2 carbon neighbors? I guess it is the first... right? There would be an easy way to do the second using a CompositeCategory?

Finally, returning to the iteration of results, which type of info would provide something like for result in results.B? In the case of neighbors, it would return the number of neighbors for each result, the indices of the neighbors for each result? Could this Category-specific results have attributes? Say:

for result in results.B :
    print(result.n_neighbors) # number of neighbors
    print(result.neighbors)     # indices of neighbors
    ...

And, super finally, wouldn't make more sense than this Category-specific information was obtained per result instead from the list of results ... :

for result in results :
    print(result.B.n_neighbors) # number of neighbors
    print(result.B.neighbors)     # indices of neighbors
    print(result.B.distances)     # distance to each neighbor
    ...

Of course I guess this also depends on how simple/complex you want each Category to be..

pfebrer commented 7 months ago

Agree

2/3. I don't understand the results you are proposing. Why do we need a 2D array of results? To me the result of categorizing should be an atom-wise property. I.e. an array of length na containing the category of each atom. I think anything else is overcomplicating the implementation and the usage. Perhaps we are trying to fit into the concept something that simply can't fit (ring identification).

Again I think this might be overcomplicating something as simple as categorization. Here I see the point that you are making, but do you have an example where it could be useful?
Agree, it is better that it disappears :)

I think a clean approach would be to first define atom-wise properties, maybe under an AtomProperty class: some properties could be:

Z
NNeighbors
x, y, z
... basically all properties that are checked in the current categories and maybe some more.

These properties, when combined with a geometry, should return an array containing the property value for each atom.

Then, from them, using comparison operators, one could create a BooleanCategory (a subclass of AtomProperty maybe?), that could be used as a mask to select atoms for example:

NNeighbors == 2: Boolean mask specifying if the atom has two neighbors or a different number.
NNeighbors > 2: Boolean mask specifying if the atom has more than two neighbors.

These categories would be combined into composite categories as the current categories are now.

Then one could also use these properties to create IntCategory or StrCategory. Although here I don't see the need for special classes, they could be simple AtomProperty whose property is an integer or a string. But the point I want to make is that these would be used for classifying, rather than creating masks:

NNeighbors is already a "classifier" since it returns integers.
Clustered([Z, NNeighbors], some_clustering_algorithm): Here Clustered would take some other property (or multiple of them) and acording to some algorithm group atoms into clusters. Each atom would have an integer assigned which is the cluster index.

The idea is that these atom properties I think could be much more versatile and clean, as they would be defined as atom-wise arrays of data. So their shape is more predictable. And they can be reused for other things. E.g. to color or scale atoms in plots :smile:

I would advocate for their returns being simple arrays, but maybe CategoryResult could be a wrapper that makes it look and work like an array while containing extra metadata.

I think the ring identification could be better off in a separate functionality unless we can frame it as an atom-wise problem. Otherwise it makes the concept of categories much more complicated.

zerothi commented 7 months ago

Hi @zerothi, I like very much the overall approach, and to me it looks quite logical and organized. I also like both the for result in results.select(neighbor): and for result in results.B: construction. I understand that "A" in your example would be the first category indicated in the composite creation, and "B" the second, right?
results.A --> neighbor
results.B --> AtomZ(6)
Indeed, that was the idea.

In this example what it is not fully clear to me is to what AtomZ(6) applies? To the targeted ia whose bonds are scrutinized (does ia have 2 bonds?) or to its neighbors? So, in other words, cat in cat = neighbor & AtomZ(6) would provide all carbon atoms having (say) 2 neighbors or it would provide all atoms having 2 carbon neighbors? I guess it is the first... right? There would be an easy way to do the second using a CompositeCategory? The first, yes, AtomZ selects atoms with a given Z. Then the neighbors are determined on the smaller section.

Finally, returning to the iteration of results, which type of info would provide something like for result in results.B? In the case of neighbors, it would return the number of neighbors for each result, the indices of the neighbors for each result? Could this Category-specific results have attributes? Say:
for result in results.B :
    print(result.n_neighbors) # number of neighbors
    print(result.neighbors)     # indices of neighbors
    ...
Yeah, here I am a bit more undecided. Since you are already in the AtomZ category, you can't access the neighbors. That is a bit difficult on the composite categories.

And, super finally, wouldn't make more sense than this Category-specific information was obtained per result instead from the list of results ... :
for result in results :
    print(result.B.n_neighbors) # number of neighbors
    print(result.B.neighbors)     # indices of neighbors
    print(result.B.distances)     # distance to each neighbor
    ...
Of course I guess this also depends on how simple/complex you want each Category to be.. Agreed! This is a good suggestion, lets continue along that line.

Agree

2/3. I don't understand the results you are proposing. Why do we need a 2D array of results? To me the result of categorizing should be an atom-wise property. I.e. an array of length na containing the category of each atom. I think anything else is overcomplicating the implementation and the usage. Perhaps we are trying to fit into the concept something that simply can't fit (ring identification).

The main problem of the way things currently work is exactly that it is a category per atom. If the category is per atom, then it requires additional checks to describe which atoms belongs to a ring or not? No? Perhaps the 2D lists would only be interesting for those categories which deals with multiple atoms (such as the ring). But for AtomZ it could just as well be a 1D array.

4. Again I think this might be overcomplicating something as simple as categorization. Here I see the point that you are making, but do you have an example where it could be useful?

Another motivation of mine, is the re, it works quite similarly. You query some quantities, and loop the results. A main use would be the rings, where looping the results is equivalent to looping the rings.

EDIT, sorry didn't do an actual example.

My motivation for this would be that one wanted to search for neighbors, but only with AtomZ, one could then check whether it was the lhs or the rhs which didn't match it by using this. It might not be necessary, so we could do without, and force the user to do a 2nd request. I guess this point could be left out!

5. Agree, it is better that it disappears :)
I think a clean approach would be to first define atom-wise properties, maybe under an AtomProperty class: some properties could be:
* `Z`

* `NNeighbors`

* `x`, `y`, `z`

* ... basically all properties that are checked in the current categories and maybe some more.

The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.

These properties, when combined with a geometry, should return an array containing the property value for each atom.

Then, from them, using comparison operators, one could create a BooleanCategory (a subclass of AtomProperty maybe?), that could be used as a mask to select atoms for example:
* `NNeighbors == 2`: Boolean mask specifying if the atom has two neighbors or a different number.

* `NNeighbors > 2`: Boolean mask specifying if the atom has more than two neighbors.
These categories would be combined into composite categories as the current categories are now.

Then one could also use these properties to create IntCategory or StrCategory. Although here I don't see the need for special classes, they could be simple AtomProperty whose property is an integer or a string. But the point I want to make is that these would be used for classifying, rather than creating masks:
* `NNeighbors` is already a "classifier" since it returns integers.

* `Clustered([Z, NNeighbors], some_clustering_algorithm)`: Here `Clustered` would take some other property (or multiple of them) and acording to some algorithm group atoms into clusters. Each atom would have an integer assigned which is the cluster index.
I think that this would just be difficult to extend, whereas my approach seems a bit more generic, and would allow one to search also based on other criteria. For instance, a ring with 5 Carbon and 1 Nitrogen.

The idea is that these atom properties I think could be much more versatile and clean, as they would be defined as atom-wise arrays of data. So their shape is more predictable. And they can be reused for other things. E.g. to color or scale atoms in plots 😄

I would advocate for their returns being simple arrays, but maybe CategoryResult could be a wrapper that makes it look and work like an array while containing extra metadata. If wanting 1D arrays, it should be trivially handled, no? indices = np.concatenate([result.index for result in results])

I think the ring identification could be better off in a separate functionality unless we can frame it as an atom-wise problem. Otherwise it makes the concept of categories much more complicated.

That was why I suggested to make it based on the full geometry, and not necessarily an Atom property.

My main worry is that limiting its functionality basically makes it useless. Using a AtomZ category is just geometry.atoms.Z == 6, but nesting the logical operators becomes a head-ache, and here the categories would be useful.

@pfebrer see my edit!

pfebrer commented 7 months ago

The main problem of the way things currently work is exactly that it is a category per atom.

Isn't the problem that the category is computed atom by atom? I mean if you give the whole geometry to the property calculator and then it calculates an atom-wise property, that is not a limiting factor, no?

My motivation for this would be that one wanted to search for neighbors, but only with AtomZ, one could then check whether it was the lhs or the rhs which didn't match it by using this. It might not be necessary, so we could do without, and force the user to do a 2nd request. I guess this point could be left out!

I understand this, but maybe a cleaner solution would be to cache the results of categorization / property calculation, so that a second query is basically for free?

The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.

No, why? You can get at least to the same level of complexity. E.g. imagine that the NNeighbors property accepts a neighbor argument to filter. It would be exactly like the current neighbors category:

NNeighbor(R=3) # Gives you the number of neighbors up to a certain R.
NNeighbor(R=3, neighbor=Z > 1) # Gives you the number of neighbors without counting hydrogens.

The only difference is that you specify the properties, not the categories, and then you create "categories" by using the comparison operators. I'd say that if something you have more freedom to get as complex as you want easily. You can for example create "categories" by comparing two properties:

x > y # This would be a mask that selects atoms whose x coordinate is bigger than the y coordinate.

And actually I have thought of something that could work quite good with the ring finder. The returned property could be of shape (na, n_rings). Something like:

[
 [0, 0, 0], # Atom 0 belongs to [ring_0, ring_1, ring_2] ? 
 [0, 1, 1], # Atom 1 belongs to [ring_0, ring_1, ring_2] ? 
 [1, 0, 0], # Atom 2 belongs to [ring_0, ring_1, ring_2] ? 
 ...
]

Which could be a sparse matrix, of course. But the point is that this would still be an atom-wise property. And doing things like:

rings_property = RingFinder()
rings_count = np.sum(rings_property, axis=-1)

Would give you a property that computes the number of rings to which each atom belongs. Again you would then be free to create a category however complex you want using this property.

You could do many other cool things. E.g.:

# Property that returns a (na, nZ) array specifying how many neighbors of each type an atom has.
neighs = NNeighbors(dim1=Z) 

# Category that selects atoms that have 1 hydrogen neighbor and 2 carbon neighbors.
my_complex_category = neighs[:, 1] == 1 & neighs[:, 6] == 2

Wouldn't it be nice? :smiling_face_with_tear:

pfebrer commented 7 months ago

By the way all this lazy arithmetic (and the caching) could be supported by the nodes framework.

ialcon commented 7 months ago

I kind of also understand the point of @pfebrer and, somehow, agree with it. At the same time, I also understand that, as you say @zerothi, limiting to atom-wise categorization limits a lot what one could do (specially in terms of complexity and versatility).

Would it be possible to have Categories for atom-wise properties (that is, as @pfebrer advocates) and something like Classifiers (or a similar name, maybe Fragments ...) that classify atoms based on a particular property? In such a way Categories could be maintained clean and simple, whereas Classifiers could go crazier, without affecting the fundamental structure of Categories. Of course, Classifiers could internally use Categories to achieve their classification, but Categories in turn would not care at all about Classifiers and not being affected by them at all.

So, following the initial example by @zerothi, instead of doing:

cat = CategoryRing(...) & CategoryZ(Z=6) * 5

You could have:

get_rings = Ring([(AtomZ(6), 5), (AtomZ(7), 1)])
rings = Ring.classify(geometry)
for ring in rings :
    print(ring.na)
    print(ring.id_ring)
    print(ring.idx)
    print(ring.dihedral)
    ....

I am not sure if this makes sense to you, but to me is sort of trying to have some sort of hierarchy, so that we can keep separate the complexity of more versatile objects from the simplier atom-wise categories.

zerothi commented 7 months ago

The main problem of the way things currently work is exactly that it is a category per atom.

Isn't the problem that the category is computed atom by atom? I mean if you give the whole geometry to the property calculator and then it calculates an atom-wise property, that is not a limiting factor, no?

Yes, the problem is that it is an atom-wise property. That is exactly what I am trying to propose to go about. A Category should be able to encompass any number of atoms etc. It might be overkill for AtomZ to have a list of positives, an 1D array might work here. However, I think groups of positives can be made quite versatile (as always I could be wrong) ;)

My motivation for this would be that one wanted to search for neighbors, but only with AtomZ, one could then check whether it was the lhs or the rhs which didn't match it by using this. It might not be necessary, so we could do without, and force the user to do a 2nd request. I guess this point could be left out!

I understand this, but maybe a cleaner solution would be to cache the results of categorization / property calculation, so that a second query is basically for free?

Yeah, this might be overdoing it, perhaps we shouldn't worry here. ;)

The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.

No, why? You can get at least to the same level of complexity. E.g. imagine that the NNeighbors property accepts a neighbor argument to filter. It would be exactly like the current neighbors category:
NNeighbor(R=3) # Gives you the number of neighbors up to a certain R.
NNeighbor(R=3, neighbor=Z > 1) # Gives you the number of neighbors without counting hydrogens.

But what is the difference between this and a Category, aren't they the same? Do you mean that categories are something that extracts properties, or? I don't see a big difference between the two.

The only difference is that you specify the properties, not the categories, and then you create "categories" by using the comparison operators. I'd say that if something you have more freedom to get as complex as you want easily. You can for example create "categories" by comparing two properties:
x > y # This would be a mask that selects atoms whose x coordinate is bigger than the y coordinate.
I guess categories could do the same, no ?

And actually I have thought of something that could work quite good with the ring finder. The returned property could be of shape (na, n_rings). Something like:
[
 [0, 0, 0], # Atom 0 belongs to [ring_0, ring_1, ring_2] ? 
 [0, 1, 1], # Atom 1 belongs to [ring_0, ring_1, ring_2] ? 
 [1, 0, 0], # Atom 2 belongs to [ring_0, ring_1, ring_2] ? 
 ...
]
This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like pandas.groupby?

Which could be a sparse matrix, of course. But the point is that this would still be an atom-wise property. And doing things like:
rings_property = RingFinder()
rings_count = np.sum(rings_property, axis=-1)
Would give you a property that computes the number of rings to which each atom belongs. Again you would then be free to create a category however complex you want using this property.

You could do many other cool things. E.g.:
# Property that returns a (na, nZ) array specifying how many neighbors of each type an atom has.
neighs = NNeighbors(dim1=Z) 

# Category that selects atoms that have 1 hydrogen neighbor and 2 carbon neighbors.
my_complex_category = neighs[:, 1] == 1 & neighs[:, 6] == 2
Wouldn't it be nice? 🥲 I see, I agree that it could be nice if the methods are generic in some way. Hmm, this needs some thinking. I kind of don't like the way one should post-process things, it seems to be cumbersome when wanting to do more complicated things? By specifying constraints to what you request, you have less flexibility at query time, but simpler parsing.

I kind of also understand the point of @pfebrer and, somehow, agree with it. At the same time, I also understand that, as you say @zerothi, limiting to atom-wise categorization limits a lot what one could do (specially in terms of complexity and versatility).

Would it be possible to have Categories for atom-wise properties (that is, as @pfebrer advocates) and something like Classifiers (or a similar name, maybe Fragments ...) that classify atoms based on a particular property? In such a way Categories could be maintained clean and simple, whereas Classifiers could go crazier, without affecting the fundamental structure of Categories. Of course, Classifiers could internally use Categories to achieve their classification, but Categories in turn would not care at all about Classifiers and not being affected by them at all. If we can get all functionality in 1 class, then I would really try and maintain it like that, I don't know if that will work, if not we might have to have something like your Fragments idea.

So, following the initial example by @zerothi, instead of doing:
cat = CategoryRing(...) & CategoryZ(Z=6) * 5
You could have:
get_rings = Ring([(AtomZ(6), 5), (AtomZ(7), 1)])
rings = Ring.classify(geometry)
for ring in rings :
    print(ring.na)
    print(ring.id_ring)
    print(ring.idx)
    print(ring.dihedral)
    ....
I am not sure if this makes sense to you, but to me is sort of trying to have some sort of hierarchy, so that we can keep separate the complexity of more versatile objects from the simplier atom-wise categories.

I somewhat agree, i just always find API's that have some kind of flexibility very nice to work with, it takes a bit of time to work it out. But...

That is why this discussion is important, what can be done, and how would it work ;)

pfebrer commented 7 months ago

Yes, the problem is that it is an atom-wise property. That is exactly what I am trying to propose to go about. A Category should be able to encompass any number of atoms etc. It might be overkill for AtomZ to have a list of positives, an 1D array might work here. However, I think groups of positives can be made quite versatile (as always I could be wrong) ;)

The way I see it, in the rings example you are creating multiple categories on the fly, one for each ring that you find: belongs_to_ring_0: (true, false), belongs_to_ring_1: (true, false), ... And then you are defining the value of each atom on that category. So effectively you are building a 2D boolean array like the one I propose. Then you choose to represent this sparse matrix as a list of lists storing the indices of the atoms that are True, because you think it is the most convenient way. But I believe if you think of it as an atom-wise property with an extra dimension (i_ring) it is very general because then you can reduce that result into whatever you want with common matrix operations.

# This would give you the number of rings each atom belongs to
np.sum(rings_property, axis=-1)

# This would give you some measure of "connectivity" (through rings) between atoms
rings_property @ rings_property.T

# This would give you the atoms that belong to ring 2
np.where(rings_property[:, 2])

# Etc...

This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like pandas.groupby?

The convenience of the shape of the result depends on what you want to achieve afterwards. It is not obvious to me that with the one that you propose it is simpler to achieve @ialcon's original goal (separating the two ribbons of an NPG).

I guess categories could do the same, no ?

Yes, but the point is that if you allow the user to build their own conditions you don't have to implement all the possible conditions as categories.

But what is the difference between this and a Category, aren't they the same? Do you mean that categories are something that extracts properties, or? I don't see a big difference between the two.

The main point that I want to make is that a category (as it is now) is the combination of a property + a condition. Internally inside categories we are implementing complex properties like number of neighbors, but we are not exposing that for users to use however they want. Also, we are choosing which conditions to implement, which limits the functionality.

Another thing is that what is now called Category is only a boolean category. But categorical variables are also typically integers or strings. An atom property could be of any type. And of course depending on what type it is, that property can be used for some things or for others.

Regarding the points of @ialcon, atom-wise properties could also have metadata, so I would say this is not an argument in favor of categories vs properties.

ialcon commented 7 months ago

I think that this discussion is, somehow, touching on a deeper discussion about what should be the limits of sisl regarding structure manipulation. Somehow I agree with @pfebrer that trying to implement more and more complex Categories might limit the usage of end users, partly because choosing which complex Categories are implemented is arbitrary, and might not fit what a particular user is looking for. On the contrary, if Categories are kept relatively simple, that could be a good general basis for most of users to exploit those Categories to do crazy things specific to their research problem. In this regard, I feel keeping Categories as atom-wise properties is a good approach. As @pfebrer highlighted, one still may accommodate a high degree of complexity in that "atom-wise" format, and that should well cover the level of structural complexity that sisl (at least to start with) should cover. Of course if, with time, it became clearer that a higher level of complexity was required, we could discuss again if the atom-wise type of Categories may accommodate such new level of complexity or if, contrary, a new type of non-atom-wise Categories (or Classifiers or Fragments or whatever) should be specifically build for that. That would not affect the more basic (and of general use) atom-wise Categories and, at that future point, it would also be useful to know if users are actually using the atom-wise Categories (via some sort of queries for users or something: e.g. "do you use Categories in your daily work?`) to have a better guess about whether more complex functionality could be useful.

So, summing up, I kind of agree now that the approach of @pfebrer might be the best to go as of now. We know that these atom-wise Categories will be used and, additionally, I guess they are simplier to implement, which should keep the development effort as minimal as possible - focusing on those Categories that are more obviously necessary, and building from there. Regarding Rings, I think that an atom-wise Category could be useful - after all, a Rings categorization is something sufficiently fundamental that could have general usage, I think..

pfebrer commented 7 months ago

I'm having a hard time understanding what you mean by categories that are not atom wise. What does "categorizing" mean for you? For me it means assigning a category to each entity. In the case of the geometry the most obvious entity is the atom.

I understand that you might create higher level entities like fragments, and then you might categorize those fragments. But the algorithm that creates those fragments is a grouping operation and in my view it can be understood as categorizing atoms on the category: Does atom "i" belong to fragment "j"?. You either think of it as this or simply as a grouping operation, but I don't see it as "categorizing multiple atoms". I don't understand what that means :sweat_smile:

zerothi commented 7 months ago

Yes, the problem is that it is an atom-wise property. That is exactly what I am trying to propose to go about. A Category should be able to encompass any number of atoms etc. It might be overkill for AtomZ to have a list of positives, an 1D array might work here. However, I think groups of positives can be made quite versatile (as always I could be wrong) ;)

The way I see it, in the rings example you are creating multiple categories on the fly, one for each ring that you find: belongs_to_ring_0: (true, false), belongs_to_ring_1: (true, false), ... And then you are defining the value of each atom on that category. So effectively you are building a 2D boolean array like the one I propose. Then you choose to represent this sparse matrix as a list of lists storing the indices of the atoms that are True, because you think it is the most convenient way. I disagree here. :)

My proposal can search for any rings using a simple deterministic approach that defines how a ring should be percieved.

But I believe if you think of it as an atom-wise property with an extra dimension (i_ring) it is very general because then you can reduce that result into whatever you want with common matrix operations.
# This would give you the number of rings each atom belongs to
np.sum(rings_property, axis=-1)

# This would give you some measure of "connectivity" (through rings) between atoms
rings_property @ rings_property.T

# This would give you the atoms that belong to ring 2
np.where(rings_property[:, 2])

# Etc...
This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like pandas.groupby?

The convenience of the shape of the result depends on what you want to achieve afterwards. It is not obvious to me that with the one that you propose it is simpler to achieve @ialcon's original goal (separating the two ribbons of an NPG).

I just think that what you are doing is a quite drastic change, for instance for geometries with 100,000 atoms, you should pre-define the positions of the rings? How would you do that? and then you would get a list of 100,000 x rings? Ok sparse matrices, but...

From what you propose you have to pre-define the rings, before you can make any meaningful sense of the return values, no? What does ring_0 mean? Is it at (x, y) = (10, 10) or at (100, 50) ? My approach would be a discovery of goals, whereas yours would be a predefined goal, and searching for atoms that can meet that goal. I think that is quite different.

I am not saying your method can be useful for few, and well defined properties, such as Z, x etc. But for more complicated things, it seems to scale poorly, and rather difficult to disentangle?

I guess categories could do the same, no ?

Yes, but the point is that if you allow the user to build their own conditions you don't have to implement all the possible conditions as categories.

That was never the intent, the intent was still to do as we have it now, something that defines some basic categories, then combinations of these categories via boolean operations should behave as though they were a single category.

But what is the difference between this and a Category, aren't they the same? Do you mean that categories are something that extracts properties, or? I don't see a big difference between the two.

The main point that I want to make is that a category (as it is now) is the combination of a property + a condition. Internally inside categories we are implementing complex properties like number of neighbors, but we are not exposing that for users to use however they want. Also, we are choosing which conditions to implement, which limits the functionality.

Not really, if you want users to extend the properties. Consider a user who wants to search for a defect in a system. Whether it is your or my approach, they would have to subclass a property/category anyways, no?

Another thing is that what is now called Category is only a boolean category. But categorical variables are also typically integers or strings. An atom property could be of any type. And of course depending on what type it is, that property can be used for some things or for others.

I agree that the way the categories are handled, it is a little restrictive when everything returned is a bool, that was just my initial idea, i don't know if this can be generalized out...

Regarding the points of @ialcon, atom-wise properties could also have metadata, so I would say this is not an argument in favor of categories vs properties.

So my question, is how do you define the ring property so it is generic for the end user? I can't see that working seemlessly for the bool array you are proposing.

I think that this discussion is, somehow, touching on a deeper discussion about what should be the limits of sisl regarding structure manipulation. Somehow I agree with @pfebrer that trying to implement more and more complex Categories might limit the usage of end users, partly because choosing which complex Categories are implemented is arbitrary, and might not fit what a particular user is looking for. On the contrary, if Categories are kept relatively simple, that could be a good general basis for most of users to exploit those Categories to do crazy things specific to their research problem. In this regard, I feel keeping Categories as atom-wise properties is a good approach. As @pfebrer highlighted, one still may accommodate a high degree of complexity in that "atom-wise" format, and that should well cover the level of structural complexity that sisl (at least to start with) should cover. Of course if, with time, it became clearer that a higher level of complexity was required, we could discuss again if the atom-wise type of Categories may accommodate such new level of complexity or if, contrary, a new type of non-atom-wise Categories (or Classifiers or Fragments or whatever) should be specifically build for that. That would not affect the more basic (and of general use) atom-wise Categories and, at that future point, it would also be useful to know if users are actually using the atom-wise Categories (via some sort of queries for users or something: e.g. "do you use Categories in your daily work?`) to have a better guess about whether more complex functionality could be useful.

The basic Category things are already there. The main problem is how the categories should be changed so we can allow groups of atoms to belong to a single category/property (whatever it may be).

I think that most users don't really think that the bool array that @pfebrer would be straighforward.

# properties (pfebrer, you might have thought more carefully about this, how would this look?
prop = geom.properties([Z == 6, x > 5, x < 10])
idx = np.sum(prop, axis=1).nonzero()[0]

# categories
cat = AtomZ(6) & AtomXYZ.x < 10  & AtomXYZ.x > 5
idx = cat.categorize(geom)

I think the latter is more intuitive. While the prop array can contain more details on sorting after might be useful. It can also turn out to be a bit more difficult to understand.

So, summing up, I kind of agree now that the approach of @pfebrer might be the best to go as of now. We know that these atom-wise Categories will be used and, additionally, I guess they are simplier to implement, which should keep the development effort as minimal as possible - focusing on those Categories that are more obviously necessary, and building from there. Regarding Rings, I think that an atom-wise Category could be useful - after all, a Rings categorization is something sufficiently fundamental that could have general usage, I think..

The real problem really is that we want to do Rings ;) So we can't do an atomic-only category/property, I still have a hard time seeing @pfebrer suggestion.

I am not saying my CategoryResult is perfect, it might still require some finetuning. I am just really curious as to how we can efficiently decipher groups of atoms, and where that group belongs (through meta-data).

pfebrer commented 6 months ago

My proposal can search for any rings using a simple deterministic approach that defines how a ring should be percieved.

But I propose to do exactly the same as you say, just shaping the result differently to fit in a more general framework (atom-wise properties).

I just think that what you are doing is a quite drastic change, for instance for geometries with 100,000 atoms, you should pre-define the positions of the rings? How would you do that?

No, it would work the same as your proposal. You are right that getting a certain ring index on advance doesn't make sense. You can only do it when you have the results.

and then you would get a list of 100,000 x rings?

No, it would be a sparse matrix, so basically it uses the same memory as the return that you propose.

Not really, if you want users to extend the properties. Consider a user who wants to search for a defect in a system. Whether it is your or my approach, they would have to subclass a property/category anyways, no?

If the user has to implement a new property, yes. If it's just a combination of existing properties with some comparison operators, then no.

So my question, is how do you define the ring property so it is generic for the end user? I can't see that working seemlessly for the bool array you are proposing.

I don't know, the same way that you define the category.

I think that most users don't really think that the bool array that @pfebrer would be straighforward.
# properties (pfebrer, you might have thought more carefully about this, how would this look?
prop = geom.properties([Z == 6, x > 5, x < 10])
idx = np.sum(prop, axis=1).nonzero()[0]

# categories
cat = AtomZ(6) & AtomXYZ.x < 10  & AtomXYZ.x > 5
idx = cat.categorize(geom)
I think the latter is more intuitive. While the prop array can contain more details on sorting after might be useful. It can also turn out to be a bit more difficult to understand.

No, my idea is that properties can be used in a very similar way as categories. In this example:

from sisl.atom_properties import Z, x

# Create a mask by creating a boolean atom property
mask = (Z == 6) & (5 < x < 10)
# Use it (e.g. in methods that have an atoms argument)
geom.sub(mask)

But you can also use non-boolean properties (e.g. Z or x) directly. For example, to indicate color in plots.

geom.plot(atoms_style={"color": Z, "size": x})

There can also be properties that are themselves boolean, without the need for a comparison. This would be similar to the current categories. E.g. an is_sp3 property would return True for an atom with sp3 hybridization and False otherwise:

from sisl.atom_properties import is_sp3

# We can use this property directly as a mask, because it contains boolean values
geom.sub(is_sp3)

There could also be properties that compute a string for each atom. E.g. a hybridization property that maps atoms to their hybridization.

from sisl.atom_properties import hybridization

# A compute (or get) method could be exposed to trigger the computation of the property 
hybridization.compute(geom) # ["sp", "sp2", "sp", ...]

zerothi commented 6 months ago

Ok, I kind of get where you are at. :) I am more seeing what you want now.

But, I guess all this could be resolved by exposing the functional equivalents? However, it poses some problems. When you do "color": Z the viz code still needs to know that it is a property that it should fetch from the geometry, so it requires some logic everywhere where the properties can be non-properties. Wouldn't it be better to just explicitly pass "color": geom.atoms.Z?

We could expose properties on the geometries x, y, ... etc. if that makes some things simpler. A PR for this would be accepted I think! (probably only coordinates first).

But from what I read they don't need to be exclusive, they could live together, properties and categories. It actually seems like a Z > 3 would return a category ;)

No, my idea is that properties can be used in a very similar way as categories. In this example:
from sisl.atom_properties import Z, x
# Create a mask by creating a boolean atom property
mask = (Z == 6) & (5 < x < 10)
# Use it (e.g. in methods that have an atoms argument)
geom.sub(mask)
This would be equivalent in geom.sub(category), so I don't see the difference there?

I can sort of see that properties can be powerful in the sense that they compute stuff based on the input geometry. Ok, I am somewhat convinced that the properties can be useful. I can see they can fill the same thing as categories. Possibly with a bit more usefulness.

I think, since you are thinking of properties that can return different things, probably we should think about a streamlined return object? Instead of bool/str etc. Would a PropResult object be useful here? How should it be organized? I kind of like the way groupby and re.Match are organized, it should also be more familiar to end-users? Could we leverage names and technologies from there?

Also, I don't think we should do:

 from sisl.atom_properties import Z, x

it can cause name-clashes, what about geometry.props.Z/x , or GeometryProps.Z/x or something similar? Or how can we reduce the chance of name-clashes. I frequently use x = geom.xyz.T[0].

pfebrer commented 6 months ago

But, I guess all this could be resolved by exposing the functional equivalents? However, it poses some problems. When you do "color": Z the viz code still needs to know that it is a property that it should fetch from the geometry, so it requires some logic everywhere where the properties can be non-properties. Wouldn't it be better to just explicitly pass "color": geom.atoms.Z?

What I like about the properties is the same thing that I like about the categories. You can build them when you don't have a geometry, and then use them for several geometries. So in the example of the geometry plot, if you have passed "color": geom.atoms.Z and you update the geometry, the plot will break. But if you have passed "color": Z it will work still when you update the plot's geometry.

But from what I read they don't need to be exclusive, they could live together, properties and categories. It actually seems like a Z > 3 would return a category ;)

Yes, that was my first thought, but then I didn't see the point in having a special class for boolean properties. I guess a dtype attribute on the property would be enough to check if it can be used as a mask (if you want to check it). Although it is true that boolean properties could have some methods that are specific to the type, e.g. categorize :smile:

This would be equivalent in geom.sub(category), so I don't see the difference there?

There is no difference, that's the point. But now the computation of the properties is exposed. Well there is a difference: with the current categories you can only do AtomZ(6) which is Z == 6, but not Z < 6 or Z > 6. I know that you can implement them, but the point is that with this approach you don't need to because users can build whatever they want.

I think, since you are thinking of properties that can return different things, probably we should think about a streamlined return object? Instead of bool/str etc. Would a PropResult object be useful here? How should it be organized? I kind of like the way groupby and re.Match are organized, it should also be more familiar to end-users? Could we leverage names and technologies from there?

Could it be a subclass of np.ndarray so that the result behaves like an array, but just with some extra convenient methods/attributes?

Also, I don't think we should do:
 from sisl.atom_properties import Z, x
it can cause name-clashes, what about geometry.props.Z/x , or GeometryProps.Z/x or something similar? Or how can we reduce the chance of name-clashes. I frequently use x = geom.xyz.T[0].

I guess it would be up to the user to do something like

from sisl import atom_properties
atom_properties.Z

or like

from sisl.atom_properties import Z

but we could also just use the terminology that we use currently for categories. I.e. Z -> AtomZ.

zerothi commented 6 months ago

Yeah, we might do something like that. But... ;) Do we foresee that we will never use some methods that aren't already defined in numpy? If we lock ourselves to the ndarray subclass, then we have to get the overwritten methods correct first time, it will be hard to overwrite methods later. I don't see an immediate problem, do you?

pfebrer commented 6 months ago

Which methods would you overwrite? What I mean is that this could be the result returned from the property, not the AtomProperty class itself. Maybe just something like an extra attribute info (or something like that) where the real custom object is stored could suffice? That object would contain the possible extra data coming from the property calculation.

pfebrer commented 6 months ago

Alternatively we could return a simple xarray.DataArray, which already supports storing attrs.

zerothi commented 6 months ago

What benefit is there using xarray? We should strive for the simplest solution IMHO. That will probably be a ndarray, I.e. Does xyz[Dataarray] work out of the box?

pfebrer commented 6 months ago

That we can store arbitrary attributes without subclassing and there is a built-in way of storing information about coordinates. Also it is easy to merge dataarrays of different dtype into a dataset.

Does xyz[Dataarray] work out of the box?

Yes, these kind of things work with dataarrays.

But yeah I'm also not so sure if it's a good idea keeping in mind that most people is not familiar with them.

zerothi commented 6 months ago

Yeah, I think the familiarity with numpy is good! So let's do that!

ialcon commented 6 months ago

Well, it seems that you converged... Sorry for not participating more pro-actively, but this got quite technical for me to add anything useful honestly... :S

Just for me to understand the final approach, could you quickly remind me what is the difference between the current implemented Categories and the Properties that have been proposed by @pfebrer? Also, for the Rings idea, what sort of meta-data would you associate to the np.array object? Or would the idea be to keep it as simple as possible, and so to simply provide an array with 1's and 0's associating each atom to a given Ring? (associated to each column in the array, as @pfebrer previously exposed)

pfebrer commented 6 months ago

I don't know how would the rings thing turn out in the end, but the differences between the current categories and (possible future) atom properties are:

Up until now, categories were meant only as a way to select/filter atoms. Atom properties can be of any data type (float, int, str , bool...), so they can be used for whatever case an atom property is required. Some data types might be associated to specific functions. E.g. bool atom properties are probably useful for selecting/filtering. But the user is not restricted (or suggested) to use it for that specific functionality, as the naming doesn't imply that the property is associated to any functionality.
Boolean masks could be created by applying comparisons to atom properties: Z > 6, x >= (y * 2)...
With properties, the return would be an array with the computed property, although it might contain metadata on how the property was computed or some extra information. The result of categorizing was an array containing either NullCategory if it didn't match the category or the category itself if it matched. I think arrays of built-in types should be easier to postprocess by users.

zerothi commented 6 months ago

No worries Isaac! Your comments are extremely valuable, you'll be the target user ;)! :)

Adding to @pfebrer, it seems to me that the boolean operations done on a property would implicitly return a Category. We might even have the category work equivalently, AtomZ > 6, e.g.

pfebrer commented 6 months ago

What methods would an atom category have that an atom property doesn't?

Also, your plans are still to put the rings functionality into a category, right? Could you clarify what is the reasoning behind it? Would it be an AtomCategory? I still can't see the logic behind framing this as a category problem instead of just a separate function. Unless you define it as the presence of an atom in each ring being a category that you create on the fly, as I said. But I understand that the output of that may be unnecesarily complex.

For example, something that I would understand it would make sense that it was a category/property is the number of rings each atom belongs to.

ialcon commented 6 months ago

Yes, I agree that it does not seem like the more natural thing to do... It may be that for Rings the best is to simply have a separate function that provides a list of Ring objects, which could then contain anything related to them like, for e.g., the indices of the atoms that below to each Ring instance.