zerothi / sisl

Electronic structure Python package for post analysis and large scale tight-binding DFT/NEGF calculations
https://zerothi.github.io/sisl
Mozilla Public License 2.0
201 stars 60 forks source link

More generic Grid class #502

Open pfebrer opened 2 years ago

pfebrer commented 2 years ago

While working on https://github.com/zerothi/sisl/pull/496, I'm playing with the idea of a sparse 4D grid.

I realized that I need some of the useful methods of Grid related to the division of 3d space. It also makes sense to have the same arguments on initialization.

However, of course, the grid is stored very differently. In the current Grid class it is stored as a dense 3D array in the grid attribute, while in the case of the sparse 4D grid it is stored as a sparse matrix (being the rows the raveled indices of the grid and the columns the extra dimension). We also discussed here https://github.com/zerothi/sisl/issues/341 that to allow multiple dimensions on the grid it might be useful to have some distribution of the data or store the grid partly on-disk.

If starting from scratch, I would suggest that the Grid class only implemented methods regarding the division of 3D space, and then further subclasses implemented a particular storage solution (e.g. DenseGrid, SparseGrid, etc...). However now perhaps it would make more sense to create a _Grid or BaseGrid class and make Grid inherit from it.

I'm not even sure that creating different classes for different storage solutions is the right way to do it, although this is the way scipy implements the different sparse representations of a matrix. Perhaps you can think of a better way.

zerothi commented 2 years ago

Thanks for having this thought, I have been thinking about this for some time.
I agree we need some kind of BaseGrid.

pfebrer commented 2 years ago

This goes a bit against your https://github.com/zerothi/sisl/pull/496 in the sense that the naming for that class should be changed, but...

No problem, I'm rewriting it anyway to be more elegant (and efficient) and the name is no longer SparseGrid :)

I think we should probably not use any kind of limitation on ND grids in a new grid specification.

I also think this, if users have enough memory for their particular system and grid shape they should be able to have as many dimensions as they need.

It would be great if the BaseGrid mapped a dims specification onto the Geometry axes, i.e. if you want an 3D Grid with only a single axis, one could do Grid(geometry, dims="x", datadims=2) or possibly some other method.

I already mentioned this at some point, but do you think a RealSpace from which BaseGrid inherits could make sense? A 1D grid following some lattice direction is great, but also a 1D grid along any arbitrary vector. Even perhaps points along a path that is not necessarily a line.

Along that line, I was thinking that you could also have a uniformly spaced grid where the dimensions are spherical or cilindrical coordinates, not necessarily cartesian.

Perhaps this is an ideal case for starting with xarray.Dataset, perhaps it would make sense now? I am now ready to make it a hard dependency ;) if so, should we then subclass Dataset, in some sense that would be easiest since things would be done in-place and we wouldn't need to add new methods for everything that we want to use directly... But I don't know if the Dataset documentation would remove the clarify of the sisl specific handling?

I'm not sure what is the best way of using Dataset. See here: https://docs.xarray.dev/en/stable/internals/extending-xarray.html. They suggest that you use composition instead of inheritance. They also have this "accessor" thing that they describe in the link, where you could register a class that implements sisl-specific manipulations.

This is in some sense great because it makes it more functional in the sense that people can use specific sisl functions without using the Grid interface, but also perhaps a bit of a pain because then you don't have a class for Grid. You use Dataset objects and implement functions that manipulate them, so you need to store metadata on the Dataset, which might get ugly to manage (I'm not sure though?). Perhaps the optimal thing would be to have both accessors and a sisl class storing a Dataset, then the functionality can be used in both ways. But I'm not sure of the full implications of this.

could the Dataset also be used as a sparse index handler? That would make it easier to have a common interface...

Yes, I think that could make a lot of sense. Actually for storing a "normal" dense array all you need is a DataArray, not a Dataset, because you are just storing a variable (the value) that has multiple dimensions. But to store a sparse structure it makes sense to have a Dataset, since you have to store different variables (which also have different dimensions). E.g. in a CSR matrix you need to store the ptr variable which has dimensions (row, ) , the col variable which has dimensions (i_value, ) and the data variable which has dimensions (i_value, ...). It is also possible to define relations between dimensions, although I have never done it. E.g. in this case the row and i_value dimensions are dependent on each other, and defining the relation in xarray might allow for some very nice processing capabilities with the high level APIs.

zerothi commented 2 years ago

This goes a bit against your #496 in the sense that the naming for that class should be changed, but...

No problem, I'm rewriting it anyway to be more elegant (and efficient) and the name is no longer SparseGrid :)

I think we should probably not use any kind of limitation on ND grids in a new grid specification.

I also think this, if users have enough memory for their particular system and grid shape they should be able to have as many dimensions as they need.

It would be great if the BaseGrid mapped a dims specification onto the Geometry axes, i.e. if you want an 3D Grid with only a single axis, one could do Grid(geometry, dims="x", datadims=2) or possibly some other method.

I already mentioned this at some point, but do you think a RealSpace from which BaseGrid inherits could make sense? A 1D grid following some lattice direction is great, but also a 1D grid along any arbitrary vector. Even perhaps points along a path that is not necessarily a line.

I see, so you want some kind of new class that parametrizes a line, and use that as the lattice vectors for a grid. I presume this would mean that lots of checks needs to be in place to ensure that, say, density will not be mapped onto a parametrized vector, but along straight vectors... Hmm... I see where you are going, and I kind of like the idea, but I don't know about the complexity here?

Along that line, I was thinking that you could also have a uniformly spaced grid where the dimensions are spherical or cilindrical coordinates, not necessarily cartesian.

That would also be cool.

Perhaps this is an ideal case for starting with xarray.Dataset, perhaps it would make sense now? I am now ready to make it a hard dependency ;) if so, should we then subclass Dataset, in some sense that would be easiest since things would be done in-place and we wouldn't need to add new methods for everything that we want to use directly... But I don't know if the Dataset documentation would remove the clarify of the sisl specific handling?

I'm not sure what is the best way of using Dataset. See here: https://docs.xarray.dev/en/stable/internals/extending-xarray.html. They suggest that you use composition instead of inheritance. They also have this "accessor" thing that they describe in the link, where you could register a class that implements sisl-specific manipulations.

Thanks for the link, that seems like the way to go.

This is in some sense great because it makes it more functional in the sense that people can use specific sisl functions without using the Grid interface, but also perhaps a bit of a pain because then you don't have a class for Grid. You use Dataset objects and implement functions that manipulate them, so you need to store metadata on the Dataset, which might get ugly to manage (I'm not sure though?). Perhaps the optimal thing would be to have both accessors and a sisl class storing a Dataset, then the functionality can be used in both ways. But I'm not sure of the full implications of this.

could the Dataset also be used as a sparse index handler? That would make it easier to have a common interface...

Yes, I think that could make a lot of sense. Actually for storing a "normal" dense array all you need is a DataArray, not a Dataset, because you are just storing a variable (the value) that has multiple dimensions. But to store a sparse structure it makes sense to have a Dataset, since you have to store different variables (which also have different dimensions). E.g. in a CSR matrix you need to store the ptr variable which has dimensions (row, ) , the col variable which has dimensions (i_value, ) and the data variable which has dimensions (i_value, ...). It is also possible to define relations between dimensions, although I have never done it. E.g. in this case the row and i_value dimensions are dependent on each other, and defining the relation in xarray might allow for some very nice processing capabilities with the high level APIs.

I agree that this could also work. But probably not with a CSR matrix, I think it needs to be a COO since linked dimensions results in square value data, and this forces that each orbital has the same number of connections, which is rarely the case. So I think we need to go with COO... Unless there is something I am missing.

pfebrer commented 2 years ago

I see, so you want some kind of new class that parametrizes a line, and use that as the lattice vectors for a grid. I presume this would mean that lots of checks needs to be in place to ensure that, say, density will not be mapped onto a parametrized vector, but along straight vectors... Hmm... I see where you are going, and I kind of like the idea, but I don't know about the complexity here?

I hadn't thought about it this way, but the way you phrased it with "parametrized line" sounds great :)

I think here https://github.com/zerothi/sisl/pull/496 might simplify it a lot. The only extra thing that you need to implement to make a parametrized line work is the retrieval of indices where the orbital has a nonzero value. A function that does this could be expected as a method of the Grid class. Then there is complete freedom to define it in your particular class. The naive (base) implementation would be just checking all points of the grid. This is unfeasible for 3D grids, but for a 1D parametrized line is perfectly doable I would say.

pfebrer commented 2 years ago

I say it may simplify it a lot because in that implementation (specially in the new rewrite) there is no concept of dimensions. The indices are raveled to be a single integer and all the machinery works with that. There is no need to know anything about where each index is in real space once you have the orbital values for each index.