Open zerothi opened 9 months ago
Here it would also be nice if one could count the number of categories a single atom is existing in. E.g. when extracting the rings (from #688), it might be useful to extract non-overlapping rings, in which case the initial categorization would return overlapping rings. Then one would need to figure out a single atom which is only in 1 ring. And then remove duplicates.
Here are some more thoughts on how to generalize the categories. Currently the categories are looped on individual atoms, and then checked.
However,
When we want to deal with categories involving multiple atoms, we have to know the full geometry before making a categorization.
When doing composite categories with categories involving multiple atoms (say rings), then it might be valuable to also restrict the ring
to be a combination of another set of categories, i.e. 5 C
and 1 N
atom. This could possibly be done by having a category which fixes a preset number of atoms with another category, since there are no logical symbols involving *
, we could overload the Category.__mul__
to accept a category and an integer? This could be errorprone as order of operations would matter
# this would result in the (Ring & Z) * 5, which would never be true.
cat = CategoryRing(...) & CategoryZ(Z=6) * 5
Exactly how this would work I am not sure, it should be simple, and intuitive, ideas are welcome!
Categories should act on the complete geometry. So the interface would stay the same, yet the way it works would be changed.
My idea would be that a categorization would always return a CategoryResult
object, which stores the category it originates from, and a list of positive results. E.g. for a AtomZ
it would return something like [[1], [2]]
. The point of this would be that categories involving mulitple atoms could have a result index like [[1, 2, 3, 4], [7, 8, 9]]
. The result
object could then also hold additional information that is defined by the originating Category
, this is partly inspired by the scipy
, OptimizeResult
which can be very versatile, but a bit hard to document. On the other hand, it might be simple enough for end-users since the additional information isn't always required knowledge (you just want the result arrays).
This CategoryResult
object would then have an __iter__
implementation that iterates the result index (as shown above).
This should also do the equivalent for CompositeCategories
, the result should encompass a result from the combined result of both categories. This might seem a bit dubious, however, I think it could be useful for the composites to contain the information on the nested categories. Consider for example the case where one wishes to extract the atoms only having 2 neigbors, and they must only be carbons, here it would be ideal if one could access a result from the neighbor result (it might contain the information on which atoms are its neighbors etc.) So we need a mechanism that lets one filter the result easily, since we can't push down the stored information from nested categories into a composite result (consider if there are two branches of categories, both with neighbor conditions, that would mean that there would be two (possibly different) additional information.
Hence it might be ideal to do:
neighbor = Neighbors(...)
cat = neighbor & AtomZ(6)
results = cat.categorize(geometry)
for result in results.select(neighbor):
...
this would only loop the positives of the composite category cat
, but filter out the result array from the neighbor
category (if the extra information is needed). If one is simply interested in the indices, one would simply do for result in results
.
When dealing with composites we should, by default, force the categories to be fully expanded. I.e. each category in the Compostite
should be parsed as though it didn't know of the other one. Hence one could do (same as above):
for result in results.B: # if we should name it `A`/`B` (as is done now), or `lhs`/`rhs`.
I am not sure if results.B
would be too ambiguous with results.select(AtomZ(6))
? Should it even be possible? I assume it could be useful to check for near misses with this functionality open, and it shouldn't contradict the methodology as they are separated.
Oh, and all of the above would deprecate the NullCategory
which is actually kind of weird to deal with... So that would make things simpler.
Hmm, I might come up with some more, but here some discussion points, @pfebrer @ialcon comments?
Hi @zerothi, I like very much the overall approach, and to me it looks quite logical and organized. I also like both the for result in results.select(neighbor):
and for result in results.B:
construction. I understand that "A" in your example would be the first category indicated in the composite creation, and "B" the second, right?
results.A --> neighbor
results.B --> AtomZ(6)
In this example what it is not fully clear to me is to what AtomZ(6)
applies? To the targeted ia
whose bonds are scrutinized (does ia
have 2 bonds?) or to its neighbors? So, in other words, cat
in cat = neighbor & AtomZ(6)
would provide all carbon atoms having (say) 2 neighbors or it would provide all atoms having 2 carbon neighbors? I guess it is the first... right? There would be an easy way to do the second using a CompositeCategory
?
Finally, returning to the iteration of results, which type of info would provide something like for result in results.B
? In the case of neighbors, it would return the number of neighbors for each result, the indices of the neighbors for each result? Could this Category
-specific results have attributes? Say:
for result in results.B :
print(result.n_neighbors) # number of neighbors
print(result.neighbors) # indices of neighbors
...
And, super finally, wouldn't make more sense than this Category
-specific information was obtained per result instead from the list of results ... :
for result in results :
print(result.B.n_neighbors) # number of neighbors
print(result.B.neighbors) # indices of neighbors
print(result.B.distances) # distance to each neighbor
...
Of course I guess this also depends on how simple/complex you want each Category
to be..
2/3. I don't understand the results you are proposing. Why do we need a 2D array of results? To me the result of categorizing should be an atom-wise property. I.e. an array of length na
containing the category of each atom. I think anything else is overcomplicating the implementation and the usage. Perhaps we are trying to fit into the concept something that simply can't fit (ring identification).
Again I think this might be overcomplicating something as simple as categorization. Here I see the point that you are making, but do you have an example where it could be useful?
Agree, it is better that it disappears :)
I think a clean approach would be to first define atom-wise properties, maybe under an AtomProperty
class: some properties could be:
Z
NNeighbors
x
, y
, z
These properties, when combined with a geometry, should return an array containing the property value for each atom.
Then, from them, using comparison operators, one could create a BooleanCategory
(a subclass of AtomProperty
maybe?), that could be used as a mask to select atoms for example:
NNeighbors == 2
: Boolean mask specifying if the atom has two neighbors or a different number.NNeighbors > 2
: Boolean mask specifying if the atom has more than two neighbors.These categories would be combined into composite categories as the current categories are now.
Then one could also use these properties to create IntCategory
or StrCategory
. Although here I don't see the need for special classes, they could be simple AtomProperty
whose property is an integer or a string. But the point I want to make is that these would be used for classifying, rather than creating masks:
NNeighbors
is already a "classifier" since it returns integers.Clustered([Z, NNeighbors], some_clustering_algorithm)
: Here Clustered
would take some other property (or multiple of them) and acording to some algorithm group atoms into clusters. Each atom would have an integer assigned which is the cluster index.The idea is that these atom properties I think could be much more versatile and clean, as they would be defined as atom-wise arrays of data. So their shape is more predictable. And they can be reused for other things. E.g. to color or scale atoms in plots :smile:
I would advocate for their returns being simple arrays, but maybe CategoryResult
could be a wrapper that makes it look and work like an array while containing extra metadata.
I think the ring identification could be better off in a separate functionality unless we can frame it as an atom-wise problem. Otherwise it makes the concept of categories much more complicated.
Hi @zerothi, I like very much the overall approach, and to me it looks quite logical and organized. I also like both the
for result in results.select(neighbor):
andfor result in results.B:
construction. I understand that "A" in your example would be the first category indicated in the composite creation, and "B" the second, right?results.A --> neighbor results.B --> AtomZ(6)
Indeed, that was the idea.
In this example what it is not fully clear to me is to what
AtomZ(6)
applies? To the targetedia
whose bonds are scrutinized (doesia
have 2 bonds?) or to its neighbors? So, in other words,cat
incat = neighbor & AtomZ(6)
would provide all carbon atoms having (say) 2 neighbors or it would provide all atoms having 2 carbon neighbors? I guess it is the first... right? There would be an easy way to do the second using aCompositeCategory
? The first, yes,AtomZ
selects atoms with a givenZ
. Then the neighbors are determined on the smaller section.Finally, returning to the iteration of results, which type of info would provide something like
for result in results.B
? In the case of neighbors, it would return the number of neighbors for each result, the indices of the neighbors for each result? Could thisCategory
-specific results have attributes? Say:for result in results.B : print(result.n_neighbors) # number of neighbors print(result.neighbors) # indices of neighbors ...
Yeah, here I am a bit more undecided. Since you are already in the
AtomZ
category, you can't access the neighbors. That is a bit difficult on the composite categories.And, super finally, wouldn't make more sense than this
Category
-specific information was obtained per result instead from the list of results ... :for result in results : print(result.B.n_neighbors) # number of neighbors print(result.B.neighbors) # indices of neighbors print(result.B.distances) # distance to each neighbor ...
Of course I guess this also depends on how simple/complex you want each
Category
to be.. Agreed! This is a good suggestion, lets continue along that line.
- Agree
2/3. I don't understand the results you are proposing. Why do we need a 2D array of results? To me the result of categorizing should be an atom-wise property. I.e. an array of length
na
containing the category of each atom. I think anything else is overcomplicating the implementation and the usage. Perhaps we are trying to fit into the concept something that simply can't fit (ring identification).
The main problem of the way things currently work is exactly that it is a category per atom.
If the category is per atom, then it requires additional checks to describe which atoms belongs to a ring
or not? No?
Perhaps the 2D lists would only be interesting for those categories which deals with multiple atoms (such as the ring
). But for AtomZ
it could just as well be a 1D array.
4. Again I think this might be overcomplicating something as simple as categorization. Here I see the point that you are making, but do you have an example where it could be useful?
Another motivation of mine, is the re, it works quite similarly. You query some quantities, and loop the results. A main use would be the rings, where looping the results is equivalent to looping the rings.
EDIT, sorry didn't do an actual example.
My motivation for this would be that one wanted to search for neighbors, but only with AtomZ
, one could then check whether it was the lhs
or the rhs
which didn't match it by using this. It might not be necessary, so we could do without, and force the user to do a 2nd request. I guess this point could be left out!
5. Agree, it is better that it disappears :)
I think a clean approach would be to first define atom-wise properties, maybe under an
AtomProperty
class: some properties could be:* `Z` * `NNeighbors` * `x`, `y`, `z` * ... basically all properties that are checked in the current categories and maybe some more.
The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.
These properties, when combined with a geometry, should return an array containing the property value for each atom.
Then, from them, using comparison operators, one could create a
BooleanCategory
(a subclass ofAtomProperty
maybe?), that could be used as a mask to select atoms for example:* `NNeighbors == 2`: Boolean mask specifying if the atom has two neighbors or a different number. * `NNeighbors > 2`: Boolean mask specifying if the atom has more than two neighbors.
These categories would be combined into composite categories as the current categories are now.
Then one could also use these properties to create
IntCategory
orStrCategory
. Although here I don't see the need for special classes, they could be simpleAtomProperty
whose property is an integer or a string. But the point I want to make is that these would be used for classifying, rather than creating masks:* `NNeighbors` is already a "classifier" since it returns integers. * `Clustered([Z, NNeighbors], some_clustering_algorithm)`: Here `Clustered` would take some other property (or multiple of them) and acording to some algorithm group atoms into clusters. Each atom would have an integer assigned which is the cluster index.
I think that this would just be difficult to extend, whereas my approach seems a bit more generic, and would allow one to search also based on other criteria. For instance, a ring with 5 Carbon and 1 Nitrogen.
The idea is that these atom properties I think could be much more versatile and clean, as they would be defined as atom-wise arrays of data. So their shape is more predictable. And they can be reused for other things. E.g. to color or scale atoms in plots 😄
I would advocate for their returns being simple arrays, but maybe
CategoryResult
could be a wrapper that makes it look and work like an array while containing extra metadata. If wanting 1D arrays, it should be trivially handled, no?indices = np.concatenate([result.index for result in results])
I think the ring identification could be better off in a separate functionality unless we can frame it as an atom-wise problem. Otherwise it makes the concept of categories much more complicated.
That was why I suggested to make it based on the full geometry, and not necessarily an Atom
property.
My main worry is that limiting its functionality basically makes it useless. Using a AtomZ
category is just geometry.atoms.Z == 6
, but nesting the logical operators becomes a head-ache, and here the categories would be useful.
@pfebrer see my edit!
The main problem of the way things currently work is exactly that it is a category per atom.
Isn't the problem that the category is computed atom by atom? I mean if you give the whole geometry to the property calculator and then it calculates an atom-wise property, that is not a limiting factor, no?
My motivation for this would be that one wanted to search for neighbors, but only with AtomZ, one could then check whether it was the lhs or the rhs which didn't match it by using this. It might not be necessary, so we could do without, and force the user to do a 2nd request. I guess this point could be left out!
I understand this, but maybe a cleaner solution would be to cache the results of categorization / property calculation, so that a second query is basically for free?
The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.
No, why? You can get at least to the same level of complexity. E.g. imagine that the NNeighbors
property accepts a neighbor
argument to filter. It would be exactly like the current neighbors category:
NNeighbor(R=3) # Gives you the number of neighbors up to a certain R.
NNeighbor(R=3, neighbor=Z > 1) # Gives you the number of neighbors without counting hydrogens.
The only difference is that you specify the properties, not the categories, and then you create "categories" by using the comparison operators. I'd say that if something you have more freedom to get as complex as you want easily. You can for example create "categories" by comparing two properties:
x > y # This would be a mask that selects atoms whose x coordinate is bigger than the y coordinate.
And actually I have thought of something that could work quite good with the ring finder. The returned property could be of shape (na, n_rings)
. Something like:
[
[0, 0, 0], # Atom 0 belongs to [ring_0, ring_1, ring_2] ?
[0, 1, 1], # Atom 1 belongs to [ring_0, ring_1, ring_2] ?
[1, 0, 0], # Atom 2 belongs to [ring_0, ring_1, ring_2] ?
...
]
Which could be a sparse matrix, of course. But the point is that this would still be an atom-wise property. And doing things like:
rings_property = RingFinder()
rings_count = np.sum(rings_property, axis=-1)
Would give you a property that computes the number of rings to which each atom belongs. Again you would then be free to create a category however complex you want using this property.
You could do many other cool things. E.g.:
# Property that returns a (na, nZ) array specifying how many neighbors of each type an atom has.
neighs = NNeighbors(dim1=Z)
# Category that selects atoms that have 1 hydrogen neighbor and 2 carbon neighbors.
my_complex_category = neighs[:, 1] == 1 & neighs[:, 6] == 2
Wouldn't it be nice? :smiling_face_with_tear:
By the way all this lazy arithmetic (and the caching) could be supported by the nodes
framework.
I kind of also understand the point of @pfebrer and, somehow, agree with it. At the same time, I also understand that, as you say @zerothi, limiting to atom-wise categorization limits a lot what one could do (specially in terms of complexity and versatility).
Would it be possible to have Categories
for atom-wise properties (that is, as @pfebrer advocates) and something like Classifiers
(or a similar name, maybe Fragments
...) that classify atoms based on a particular property? In such a way Categories
could be maintained clean and simple, whereas Classifiers
could go crazier, without affecting the fundamental structure of Categories
. Of course, Classifiers
could internally use Categories
to achieve their classification, but Categories
in turn would not care at all about Classifiers
and not being affected by them at all.
So, following the initial example by @zerothi, instead of doing:
cat = CategoryRing(...) & CategoryZ(Z=6) * 5
You could have:
get_rings = Ring([(AtomZ(6), 5), (AtomZ(7), 1)])
rings = Ring.classify(geometry)
for ring in rings :
print(ring.na)
print(ring.id_ring)
print(ring.idx)
print(ring.dihedral)
....
I am not sure if this makes sense to you, but to me is sort of trying to have some sort of hierarchy, so that we can keep separate the complexity of more versatile objects from the simplier atom-wise categories.
The main problem of the way things currently work is exactly that it is a category per atom.
Isn't the problem that the category is computed atom by atom? I mean if you give the whole geometry to the property calculator and then it calculates an atom-wise property, that is not a limiting factor, no?
Yes, the problem is that it is an atom-wise
property. That is exactly what I am trying to propose to go about. A Category
should be able to encompass any number of atoms etc. It might be overkill for AtomZ
to have a list of positives, an 1D array might work here. However, I think groups of positives can be made quite versatile (as always I could be wrong) ;)
My motivation for this would be that one wanted to search for neighbors, but only with AtomZ, one could then check whether it was the lhs or the rhs which didn't match it by using this. It might not be necessary, so we could do without, and force the user to do a 2nd request. I guess this point could be left out!
I understand this, but maybe a cleaner solution would be to cache the results of categorization / property calculation, so that a second query is basically for free?
Yeah, this might be overdoing it, perhaps we shouldn't worry here. ;)
The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.
No, why? You can get at least to the same level of complexity. E.g. imagine that the
NNeighbors
property accepts aneighbor
argument to filter. It would be exactly like the current neighbors category:NNeighbor(R=3) # Gives you the number of neighbors up to a certain R. NNeighbor(R=3, neighbor=Z > 1) # Gives you the number of neighbors without counting hydrogens.
But what is the difference between this and a Category
, aren't they the same?
Do you mean that categories are something that extracts properties, or?
I don't see a big difference between the two.
The only difference is that you specify the properties, not the categories, and then you create "categories" by using the comparison operators. I'd say that if something you have more freedom to get as complex as you want easily. You can for example create "categories" by comparing two properties:
x > y # This would be a mask that selects atoms whose x coordinate is bigger than the y coordinate.
I guess categories could do the same, no ?
And actually I have thought of something that could work quite good with the ring finder. The returned property could be of shape
(na, n_rings)
. Something like:[ [0, 0, 0], # Atom 0 belongs to [ring_0, ring_1, ring_2] ? [0, 1, 1], # Atom 1 belongs to [ring_0, ring_1, ring_2] ? [1, 0, 0], # Atom 2 belongs to [ring_0, ring_1, ring_2] ? ... ]
This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like
pandas.groupby
?Which could be a sparse matrix, of course. But the point is that this would still be an atom-wise property. And doing things like:
rings_property = RingFinder() rings_count = np.sum(rings_property, axis=-1)
Would give you a property that computes the number of rings to which each atom belongs. Again you would then be free to create a category however complex you want using this property.
You could do many other cool things. E.g.:
# Property that returns a (na, nZ) array specifying how many neighbors of each type an atom has. neighs = NNeighbors(dim1=Z) # Category that selects atoms that have 1 hydrogen neighbor and 2 carbon neighbors. my_complex_category = neighs[:, 1] == 1 & neighs[:, 6] == 2
Wouldn't it be nice? 🥲 I see, I agree that it could be nice if the methods are generic in some way. Hmm, this needs some thinking. I kind of don't like the way one should post-process things, it seems to be cumbersome when wanting to do more complicated things? By specifying constraints to what you request, you have less flexibility at query time, but simpler parsing.
I kind of also understand the point of @pfebrer and, somehow, agree with it. At the same time, I also understand that, as you say @zerothi, limiting to atom-wise categorization limits a lot what one could do (specially in terms of complexity and versatility).
Would it be possible to have
Categories
for atom-wise properties (that is, as @pfebrer advocates) and something likeClassifiers
(or a similar name, maybeFragments
...) that classify atoms based on a particular property? In such a wayCategories
could be maintained clean and simple, whereasClassifiers
could go crazier, without affecting the fundamental structure ofCategories
. Of course,Classifiers
could internally useCategories
to achieve their classification, butCategories
in turn would not care at all aboutClassifiers
and not being affected by them at all. If we can get all functionality in 1 class, then I would really try and maintain it like that, I don't know if that will work, if not we might have to have something like yourFragments
idea.So, following the initial example by @zerothi, instead of doing:
cat = CategoryRing(...) & CategoryZ(Z=6) * 5
You could have:
get_rings = Ring([(AtomZ(6), 5), (AtomZ(7), 1)]) rings = Ring.classify(geometry) for ring in rings : print(ring.na) print(ring.id_ring) print(ring.idx) print(ring.dihedral) ....
I am not sure if this makes sense to you, but to me is sort of trying to have some sort of hierarchy, so that we can keep separate the complexity of more versatile objects from the simplier atom-wise categories.
I somewhat agree, i just always find API's that have some kind of flexibility very nice to work with, it takes a bit of time to work it out. But...
That is why this discussion is important, what can be done, and how would it work ;)
Yes, the problem is that it is an atom-wise property. That is exactly what I am trying to propose to go about. A Category should be able to encompass any number of atoms etc. It might be overkill for AtomZ to have a list of positives, an 1D array might work here. However, I think groups of positives can be made quite versatile (as always I could be wrong) ;)
The way I see it, in the rings example you are creating multiple categories on the fly, one for each ring that you find: belongs_to_ring_0
: (true, false), belongs_to_ring_1
: (true, false), ... And then you are defining the value of each atom on that category. So effectively you are building a 2D boolean array like the one I propose. Then you choose to represent this sparse matrix as a list of lists storing the indices of the atoms that are True
, because you think it is the most convenient way. But I believe if you think of it as an atom-wise property with an extra dimension (i_ring
) it is very general because then you can reduce that result into whatever you want with common matrix operations.
# This would give you the number of rings each atom belongs to
np.sum(rings_property, axis=-1)
# This would give you some measure of "connectivity" (through rings) between atoms
rings_property @ rings_property.T
# This would give you the atoms that belong to ring 2
np.where(rings_property[:, 2])
# Etc...
This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like pandas.groupby?
The convenience of the shape of the result depends on what you want to achieve afterwards. It is not obvious to me that with the one that you propose it is simpler to achieve @ialcon's original goal (separating the two ribbons of an NPG).
I guess categories could do the same, no ?
Yes, but the point is that if you allow the user to build their own conditions you don't have to implement all the possible conditions as categories.
But what is the difference between this and a Category, aren't they the same? Do you mean that categories are something that extracts properties, or? I don't see a big difference between the two.
The main point that I want to make is that a category (as it is now) is the combination of a property + a condition. Internally inside categories we are implementing complex properties like number of neighbors, but we are not exposing that for users to use however they want. Also, we are choosing which conditions to implement, which limits the functionality.
Another thing is that what is now called Category
is only a boolean category. But categorical variables are also typically integers or strings. An atom property could be of any type. And of course depending on what type it is, that property can be used for some things or for others.
Regarding the points of @ialcon, atom-wise properties could also have metadata, so I would say this is not an argument in favor of categories vs properties.
I think that this discussion is, somehow, touching on a deeper discussion about what should be the limits of sisl
regarding structure manipulation. Somehow I agree with @pfebrer that trying to implement more and more complex Categories
might limit the usage of end users, partly because choosing which complex Categories
are implemented is arbitrary, and might not fit what a particular user is looking for. On the contrary, if Categories
are kept relatively simple, that could be a good general basis for most of users to exploit those Categories
to do crazy things specific to their research problem. In this regard, I feel keeping Categories
as atom-wise properties is a good approach. As @pfebrer highlighted, one still may accommodate a high degree of complexity in that "atom-wise" format, and that should well cover the level of structural complexity that sisl
(at least to start with) should cover. Of course if, with time, it became clearer that a higher level of complexity was required, we could discuss again if the atom-wise type of Categories
may accommodate such new level of complexity or if, contrary, a new type of non-atom-wise Categories
(or Classifiers
or Fragments
or whatever) should be specifically build for that. That would not affect the more basic (and of general use) atom-wise Categories
and, at that future point, it would also be useful to know if users are actually using the atom-wise Categories
(via some sort of queries for users or something: e.g. "do you use Categories
in your daily work?`) to have a better guess about whether more complex functionality could be useful.
So, summing up, I kind of agree now that the approach of @pfebrer might be the best to go as of now. We know that these atom-wise Categories
will be used and, additionally, I guess they are simplier to implement, which should keep the development effort as minimal as possible - focusing on those Categories
that are more obviously necessary, and building from there. Regarding Rings
, I think that an atom-wise Category
could be useful - after all, a Rings
categorization is something sufficiently fundamental that could have general usage, I think..
I'm having a hard time understanding what you mean by categories that are not atom wise. What does "categorizing" mean for you? For me it means assigning a category to each entity. In the case of the geometry the most obvious entity is the atom.
I understand that you might create higher level entities like fragments, and then you might categorize those fragments. But the algorithm that creates those fragments is a grouping operation and in my view it can be understood as categorizing atoms on the category: Does atom "i" belong to fragment "j"?
. You either think of it as this or simply as a grouping operation, but I don't see it as "categorizing multiple atoms". I don't understand what that means :sweat_smile:
Yes, the problem is that it is an atom-wise property. That is exactly what I am trying to propose to go about. A Category should be able to encompass any number of atoms etc. It might be overkill for AtomZ to have a list of positives, an 1D array might work here. However, I think groups of positives can be made quite versatile (as always I could be wrong) ;)
The way I see it, in the rings example you are creating multiple categories on the fly, one for each ring that you find:
belongs_to_ring_0
: (true, false),belongs_to_ring_1
: (true, false), ... And then you are defining the value of each atom on that category. So effectively you are building a 2D boolean array like the one I propose. Then you choose to represent this sparse matrix as a list of lists storing the indices of the atoms that areTrue
, because you think it is the most convenient way. I disagree here. :)
My proposal can search for any rings using a simple deterministic approach that defines how a ring should be percieved.
But I believe if you think of it as an atom-wise property with an extra dimension (
i_ring
) it is very general because then you can reduce that result into whatever you want with common matrix operations.# This would give you the number of rings each atom belongs to np.sum(rings_property, axis=-1) # This would give you some measure of "connectivity" (through rings) between atoms rings_property @ rings_property.T # This would give you the atoms that belong to ring 2 np.where(rings_property[:, 2]) # Etc...
This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like pandas.groupby?
The convenience of the shape of the result depends on what you want to achieve afterwards. It is not obvious to me that with the one that you propose it is simpler to achieve @ialcon's original goal (separating the two ribbons of an NPG).
I just think that what you are doing is a quite drastic change, for instance for geometries with 100,000 atoms, you should pre-define the positions of the rings? How would you do that? and then you would get a list of 100,000 x rings? Ok sparse matrices, but...
From what you propose you have to pre-define the rings, before you can make any meaningful sense of the return values, no? What does ring_0
mean? Is it at (x, y) = (10, 10)
or at (100, 50)
?
My approach would be a discovery of goals, whereas yours would be a predefined goal, and searching for atoms that can meet that goal.
I think that is quite different.
I am not saying your method can be useful for few, and well defined properties, such as Z
, x
etc. But for more complicated things, it seems to scale poorly, and rather difficult to disentangle?
I guess categories could do the same, no ?
Yes, but the point is that if you allow the user to build their own conditions you don't have to implement all the possible conditions as categories.
That was never the intent, the intent was still to do as we have it now, something that defines some basic categories, then combinations of these categories via boolean operations should behave as though they were a single category.
But what is the difference between this and a Category, aren't they the same? Do you mean that categories are something that extracts properties, or? I don't see a big difference between the two.
The main point that I want to make is that a category (as it is now) is the combination of a property + a condition. Internally inside categories we are implementing complex properties like number of neighbors, but we are not exposing that for users to use however they want. Also, we are choosing which conditions to implement, which limits the functionality.
Not really, if you want users to extend the properties. Consider a user who wants to search for a defect in a system. Whether it is your or my approach, they would have to subclass a property/category anyways, no?
Another thing is that what is now called
Category
is only a boolean category. But categorical variables are also typically integers or strings. An atom property could be of any type. And of course depending on what type it is, that property can be used for some things or for others.
I agree that the way the categories are handled, it is a little restrictive when everything returned is a bool
, that was just my initial idea, i don't know if this can be generalized out...
Regarding the points of @ialcon, atom-wise properties could also have metadata, so I would say this is not an argument in favor of categories vs properties.
So my question, is how do you define the ring
property so it is generic for the end user? I can't see that working seemlessly for the bool
array you are proposing.
I think that this discussion is, somehow, touching on a deeper discussion about what should be the limits of
sisl
regarding structure manipulation. Somehow I agree with @pfebrer that trying to implement more and more complexCategories
might limit the usage of end users, partly because choosing which complexCategories
are implemented is arbitrary, and might not fit what a particular user is looking for. On the contrary, ifCategories
are kept relatively simple, that could be a good general basis for most of users to exploit thoseCategories
to do crazy things specific to their research problem. In this regard, I feel keepingCategories
as atom-wise properties is a good approach. As @pfebrer highlighted, one still may accommodate a high degree of complexity in that "atom-wise" format, and that should well cover the level of structural complexity thatsisl
(at least to start with) should cover. Of course if, with time, it became clearer that a higher level of complexity was required, we could discuss again if the atom-wise type ofCategories
may accommodate such new level of complexity or if, contrary, a new type of non-atom-wiseCategories
(orClassifiers
orFragments
or whatever) should be specifically build for that. That would not affect the more basic (and of general use) atom-wiseCategories
and, at that future point, it would also be useful to know if users are actually using the atom-wiseCategories
(via some sort of queries for users or something: e.g. "do you useCategories
in your daily work?`) to have a better guess about whether more complex functionality could be useful.
The basic Category
things are already there. The main problem is how the categories should be changed so we can allow groups of atoms to belong to a single category/property (whatever it may be).
I think that most users don't really think that the bool
array that @pfebrer would be straighforward.
# properties (pfebrer, you might have thought more carefully about this, how would this look?
prop = geom.properties([Z == 6, x > 5, x < 10])
idx = np.sum(prop, axis=1).nonzero()[0]
# categories
cat = AtomZ(6) & AtomXYZ.x < 10 & AtomXYZ.x > 5
idx = cat.categorize(geom)
I think the latter is more intuitive. While the prop
array can contain more details on sorting after might be useful. It can also turn out to be a bit more difficult to understand.
So, summing up, I kind of agree now that the approach of @pfebrer might be the best to go as of now. We know that these atom-wise
Categories
will be used and, additionally, I guess they are simplier to implement, which should keep the development effort as minimal as possible - focusing on thoseCategories
that are more obviously necessary, and building from there. RegardingRings
, I think that an atom-wiseCategory
could be useful - after all, aRings
categorization is something sufficiently fundamental that could have general usage, I think..
The real problem really is that we want to do Rings
;)
So we can't do an atomic-only category/property, I still have a hard time seeing @pfebrer suggestion.
I am not saying my CategoryResult
is perfect, it might still require some finetuning. I am just really curious as to how we can efficiently decipher groups of atoms, and where that group belongs (through meta-data).
My proposal can search for any rings using a simple deterministic approach that defines how a ring should be percieved.
But I propose to do exactly the same as you say, just shaping the result differently to fit in a more general framework (atom-wise properties).
I just think that what you are doing is a quite drastic change, for instance for geometries with 100,000 atoms, you should pre-define the positions of the rings? How would you do that?
No, it would work the same as your proposal. You are right that getting a certain ring index on advance doesn't make sense. You can only do it when you have the results.
and then you would get a list of 100,000 x rings?
No, it would be a sparse matrix, so basically it uses the same memory as the return that you propose.
Not really, if you want users to extend the properties. Consider a user who wants to search for a defect in a system. Whether it is your or my approach, they would have to subclass a property/category anyways, no?
If the user has to implement a new property, yes. If it's just a combination of existing properties with some comparison operators, then no.
So my question, is how do you define the ring property so it is generic for the end user? I can't see that working seemlessly for the bool array you are proposing.
I don't know, the same way that you define the category.
I think that most users don't really think that the bool array that @pfebrer would be straighforward.
# properties (pfebrer, you might have thought more carefully about this, how would this look? prop = geom.properties([Z == 6, x > 5, x < 10]) idx = np.sum(prop, axis=1).nonzero()[0] # categories cat = AtomZ(6) & AtomXYZ.x < 10 & AtomXYZ.x > 5 idx = cat.categorize(geom)
I think the latter is more intuitive. While the prop array can contain more details on sorting after might be useful. It can also turn out to be a bit more difficult to understand.
No, my idea is that properties can be used in a very similar way as categories. In this example:
from sisl.atom_properties import Z, x
# Create a mask by creating a boolean atom property
mask = (Z == 6) & (5 < x < 10)
# Use it (e.g. in methods that have an atoms argument)
geom.sub(mask)
But you can also use non-boolean properties (e.g. Z
or x
) directly. For example, to indicate color in plots.
geom.plot(atoms_style={"color": Z, "size": x})
There can also be properties that are themselves boolean, without the need for a comparison. This would be similar to the current categories. E.g. an is_sp3
property would return True
for an atom with sp3
hybridization and False
otherwise:
from sisl.atom_properties import is_sp3
# We can use this property directly as a mask, because it contains boolean values
geom.sub(is_sp3)
There could also be properties that compute a string for each atom. E.g. a hybridization
property that maps atoms to their hybridization.
from sisl.atom_properties import hybridization
# A compute (or get) method could be exposed to trigger the computation of the property
hybridization.compute(geom) # ["sp", "sp2", "sp", ...]
Ok, I kind of get where you are at. :) I am more seeing what you want now.
But, I guess all this could be resolved by exposing the functional equivalents? However, it poses some problems.
When you do "color": Z
the viz
code still needs to know that it is a property that it should fetch from the geometry
, so it requires some logic everywhere where the properties can be non-properties. Wouldn't it be better to just explicitly pass "color": geom.atoms.Z
?
We could expose properties on the geometries x
, y
, ... etc. if that makes some things simpler. A PR for this would be accepted I think! (probably only coordinates first).
But from what I read they don't need to be exclusive, they could live together, properties and categories. It actually seems like a Z > 3
would return a category ;)
No, my idea is that properties can be used in a very similar way as categories. In this example:
from sisl.atom_properties import Z, x # Create a mask by creating a boolean atom property mask = (Z == 6) & (5 < x < 10) # Use it (e.g. in methods that have an atoms argument) geom.sub(mask)
This would be equivalent in
geom.sub(category)
, so I don't see the difference there?
I can sort of see that properties can be powerful in the sense that they compute stuff based on the input geometry. Ok, I am somewhat convinced that the properties can be useful. I can see they can fill the same thing as categories. Possibly with a bit more usefulness.
I think, since you are thinking of properties that can return different things, probably we should think about a streamlined return object? Instead of bool/str etc.
Would a PropResult
object be useful here? How should it be organized? I kind of like the way groupby
and re.Match
are organized, it should also be more familiar to end-users? Could we leverage names and technologies from there?
Also, I don't think we should do:
from sisl.atom_properties import Z, x
it can cause name-clashes, what about geometry.props.Z/x
, or GeometryProps.Z/x
or something similar? Or how can we reduce the chance of name-clashes. I frequently use x = geom.xyz.T[0]
.
But, I guess all this could be resolved by exposing the functional equivalents? However, it poses some problems. When you do "color": Z the viz code still needs to know that it is a property that it should fetch from the geometry, so it requires some logic everywhere where the properties can be non-properties. Wouldn't it be better to just explicitly pass "color": geom.atoms.Z?
What I like about the properties is the same thing that I like about the categories. You can build them when you don't have a geometry, and then use them for several geometries. So in the example of the geometry plot, if you have passed "color": geom.atoms.Z
and you update the geometry, the plot will break. But if you have passed "color": Z
it will work still when you update the plot's geometry.
But from what I read they don't need to be exclusive, they could live together, properties and categories. It actually seems like a Z > 3 would return a category ;)
Yes, that was my first thought, but then I didn't see the point in having a special class for boolean properties. I guess a dtype
attribute on the property would be enough to check if it can be used as a mask (if you want to check it). Although it is true that boolean properties could have some methods that are specific to the type, e.g. categorize
:smile:
This would be equivalent in geom.sub(category), so I don't see the difference there?
There is no difference, that's the point. But now the computation of the properties is exposed. Well there is a difference: with the current categories you can only do AtomZ(6)
which is Z == 6
, but not Z < 6
or Z > 6
. I know that you can implement them, but the point is that with this approach you don't need to because users can build whatever they want.
I think, since you are thinking of properties that can return different things, probably we should think about a streamlined return object? Instead of bool/str etc. Would a PropResult object be useful here? How should it be organized? I kind of like the way groupby and re.Match are organized, it should also be more familiar to end-users? Could we leverage names and technologies from there?
Could it be a subclass of np.ndarray
so that the result behaves like an array, but just with some extra convenient methods/attributes?
Also, I don't think we should do:
from sisl.atom_properties import Z, x
it can cause name-clashes, what about geometry.props.Z/x , or GeometryProps.Z/x or something similar? Or how can we reduce the chance of name-clashes. I frequently use x = geom.xyz.T[0].
I guess it would be up to the user to do something like
from sisl import atom_properties
atom_properties.Z
or like
from sisl.atom_properties import Z
but we could also just use the terminology that we use currently for categories. I.e. Z
-> AtomZ
.
Yeah, we might do something like that. But... ;)
Do we foresee that we will never use some methods that aren't already defined in numpy
? If we lock ourselves to the ndarray
subclass, then we have to get the overwritten methods correct first time, it will be hard to overwrite methods later.
I don't see an immediate problem, do you?
Which methods would you overwrite? What I mean is that this could be the result returned from the property, not the AtomProperty
class itself. Maybe just something like an extra attribute info
(or something like that) where the real custom object is stored could suffice? That object would contain the possible extra data coming from the property calculation.
Alternatively we could return a simple xarray.DataArray
, which already supports storing attrs
.
What benefit is there using xarray? We should strive for the simplest solution IMHO. That will probably be a ndarray, I.e. Does xyz[Dataarray]
work out of the box?
That we can store arbitrary attributes without subclassing and there is a built-in way of storing information about coordinates. Also it is easy to merge dataarrays of different dtype into a dataset.
Does xyz[Dataarray] work out of the box?
Yes, these kind of things work with dataarrays.
But yeah I'm also not so sure if it's a good idea keeping in mind that most people is not familiar with them.
Yeah, I think the familiarity with numpy is good! So let's do that!
Well, it seems that you converged... Sorry for not participating more pro-actively, but this got quite technical for me to add anything useful honestly... :S
Just for me to understand the final approach, could you quickly remind me what is the difference between the current implemented Categories
and the Properties
that have been proposed by @pfebrer? Also, for the Rings
idea, what sort of meta-data would you associate to the np.array
object? Or would the idea be to keep it as simple as possible, and so to simply provide an array with 1's and 0's associating each atom to a given Ring
? (associated to each column in the array, as @pfebrer previously exposed)
I don't know how would the rings thing turn out in the end, but the differences between the current categories and (possible future) atom properties are:
float
, int
, str
, bool
...), so they can be used for whatever case an atom property is required. Some data types might be associated to specific functions. E.g. bool
atom properties are probably useful for selecting/filtering. But the user is not restricted (or suggested) to use it for that specific functionality, as the naming doesn't imply that the property is associated to any functionality.Z > 6
, x >= (y * 2)
...NullCategory
if it didn't match the category or the category itself if it matched. I think arrays of built-in types should be easier to postprocess by users.No worries Isaac! Your comments are extremely valuable, you'll be the target user ;)! :)
Adding to @pfebrer, it seems to me that the boolean operations done on a property would implicitly return a Category. We might even have the category work equivalently, AtomZ > 6
, e.g.
What methods would an atom category have that an atom property doesn't?
Also, your plans are still to put the rings functionality into a category, right? Could you clarify what is the reasoning behind it? Would it be an AtomCategory
? I still can't see the logic behind framing this as a category problem instead of just a separate function. Unless you define it as the presence of an atom in each ring being a category that you create on the fly, as I said. But I understand that the output of that may be unnecesarily complex.
For example, something that I would understand it would make sense that it was a category/property is the number of rings each atom belongs to.
Yes, I agree that it does not seem like the more natural thing to do... It may be that for Rings
the best is to simply have a separate function that provides a list of Ring
objects, which could then contain anything related to them like, for e.g., the indices of the atoms that below to each Ring
instance.
Describe the feature
Currently the categories are quite strict in the sense that they provide the functionality to check individual atoms.
However, there are cases where it could be valuable to retain state information in the form of
A primary concern when trying to solve #202 is to reduce the overhead of doing the neighbor estimation twice, which is necessary when doing the categories.
Instead we might propose something like this:
when categorizing a
Category
, it should return aCategoryResult
class specifying whether it succeeded (if res
) and some information. I imagine something like:This approach is not too different from
scipy
optimizers. My main worry is the optional variables associated with the categories.