zerothi / sisl

Electronic structure Python package for post analysis and large scale tight-binding DFT/NEGF calculations
https://zerothi.github.io/sisl
Mozilla Public License 2.0
173 stars 57 forks source link

fixed lots of parallel calculations #790

Closed zerothi closed 2 weeks ago

zerothi commented 2 weeks ago

A lot of refactoring of the parallel codes enables huge speedups.

The main culprit for bad scaling is the chunksize.

A new environment variable has been introduced:

SISL_PAR_CHUNKSIZE=25

it specifies the default chunksize for the pool.*() methods. Generally this shows a perfect scaling while fine-tuning can generally help.

I can now see huge parallel performance benefits by leveraging the chunksize variable.

The dispatcher for the bz.apply method now allows a finer tuning of the pool creation.

  1. Single argument

    bz.apply(pool=2)

    Will create a pool with 2 processors

  2. Two tuple

    bz.apply(pool=(2, {"chunksize": 200}))

    will create a pool of 2 processors, and the chunksize will be 200 (regardless of SISL_PAR_CHUNKSIZE)

  3. Three tuple

    bz.apply(pool=(2, {"args": 1}, {"chunksize": 200}))

    will create a pool of 2 processors like so:

    pathos.pools.ProcessPoll(nodes=2, args=1)

    check the documentation for pathos to see what can be done. Generally this need not be used.

These 3 variants are currently added, but I would like some input to see whether it makes sense, or whether we should change arguments etc.

The default number of processers is still 1, this is because the OMP_NUM_THREADS can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES. So to be on the safe side, we default it to 1.

The simplest way to control things is to do this in the code:

bz.apply(pool=True)...

then invoking the script with:

SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py

then the procs are determined from the variable. In addition, all parallel invocations will now correctly update the progressbars from tqdm.

Since many routines might loop over distribution functions, we have changed them to enable broad-casting (internally). This required us to bump the required numpy version to >=1.20. This release is from January 2021.

Many dot calls has been changed to @, numpy recommends matmul when that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray) can in certain cases result in an np.ndarray with dtype=object where the actual elements are actually a sparsematrix. So we can't use that.

There are still some corner cases where @ cannot be used. E.g. 1 @ array will fail, it does not work on scalars. This is a bit unfortunate as it would ease things a bit.

Added more typing in state.py, electron.py and some minor other places.

zerothi commented 2 weeks ago

@pfebrer @zerothi @nils-wittemeier

I think I have finally found the root cause of bad parallel performance. I can get massive speedups with very little effort.

Please try and read the above, and if you have some complex workflows, feel free to give them a try. If this makes sense, and you can reproduce, I think we need to add some kind of section with parallel performance to teach users how to get massive speedups.

pfebrer commented 2 weeks ago

I have checked that this works for pdos plots and bands plots (without any changes) :tada:

You can test it with a script like:

parallel_plots.py

import sisl

H = sisl.get_sile("my_run.fdf").read_hamiltonian()

H.plot.pdos(kgrid=[100, 1, 1]).show()

One process:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=1 time python parallel_plots.py
108.89user 0.46system 1:49.47elapsed 99%CPU

Two processes:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=1 time python parallel_plots.py
124.71user 1.29system 1:04.45elapsed 195%CPU

Nice!

pfebrer commented 2 weeks ago

I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with SISL_NUM_PROCS and the size of the matrix :thinking:

E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets SISL_NUM_PROCS=2 they would expect some acceleration without needing to tweak anything else. But if chunksize is 25 they will get no acceleration.

I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it.

pfebrer commented 2 weeks ago

By the way, hybrid parallelization also works a little bit:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py
113.37user 7.02system 0:46.55elapsed 258%CPU

I wonder if it would work better in a computer that is not my laptop :thinking:

zerothi commented 2 weeks ago

I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with SISL_NUM_PROCS and the size of the matrix 🤔

E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets SISL_NUM_PROCS=2 they would expect some acceleration without needing to tweak anything else. But if chunksize is 25 they will get no acceleration.

I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it.

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS. But then the environ class should be able to handle multiple values, I guess this is ok...

I will play with that.

By the way, hybrid parallelization also works a little bit:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py
113.37user 7.02system 0:46.55elapsed 258%CPU

I wonder if it would work better in a computer that is not my laptop 🤔

Yes, hybrid works just fine. For larger systems you'll see it even better. However, there are some fine-tuning of thread placements that might not be optimal, and hence it takes some cycles before it finds the best spot...

I'll make parallel the default, if SISL_NUM_PROCS > 1.

pfebrer commented 2 weeks ago

Although in my case 2 processors doesn't seem to be much better than two threads:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=2 time python parallel_plots.py
93.85user 6.01system 1:14.14elapsed 134%CPU

I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this?

zerothi commented 2 weeks ago

Although in my case 2 processors doesn't seem to be much better than two threads:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=2 time python parallel_plots.py
93.85user 6.01system 1:14.14elapsed 134%CPU

I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this?

The threading is used for BLAS, so you'll only see this for matrices of some size. Probably above 250 (the bigger the better). You can try and tile so the matrix is ~1000, then you should be able to more clearly see it. I don't know exactly how the placement influences here, it might be that the threads are really battling because the processes are forked (not spawned), so you might get some other results if you use something like this:

import multiprocessing as mp
mp.set_start_method("spawn")

<rest of script>
pfebrer commented 2 weeks ago

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS.

I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.

I'll make parallel the default, if SISL_NUM_PROCS > 1.

Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications.

pfebrer commented 2 weeks ago

Do you know if pathos is able to distribute work across computing nodes? :thinking: I.e. if the script has been launched with slurm.

If not, maybe there is another solution that can do that, which could be useful down the line :)

zerothi commented 2 weeks ago

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS.

I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.

I'll make parallel the default, if SISL_NUM_PROCS > 1.

Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications.

On the other hand, you might do a convergence test, in which case the fraction might be a good idea? I'll add it, we can always adjust the default behaviour.

zerothi commented 2 weeks ago

Do you know if pathos is able to distribute work across computing nodes? 🤔 I.e. if the script has been launched with slurm.

If not, maybe there is another solution that can do that, which could be useful down the line :)

There is, but I don't want to complicate things here... ;) It should be much simpler to do your own mpi4py wrapper which does this ;)

pfebrer commented 2 weeks ago

There is, but I don't want to complicate things here... ;) It should be much simpler to do your own mpi4py wrapper which does this ;)

Ok, maybe this could be a util in the toolbox. Something like:

mpirun -n 20 sisl_toolbox pdos/bands RUN.fdf

And that could generate a .PDOS or .bands file.

pfebrer commented 2 weeks ago

Another thing that would speed up calculations significantly (more than this) would be to merge #496 🙄

It gives a 100x speed up to the density computation, for me it is the difference between being able to use sisl to compute thousands of densities or not, and the change in the public API is extremely minimal.

rreho commented 1 week ago

Hi, Related to this conversation on parallelization, I have a few questions related to the class sisl.viz.FatBandsPlot (I am using the latest version of sisl: master branch). I provide here a snapshot of the Python script:

fatbands = band_struct.plot.fatbands(bands_style= {'color': 'grey', 'opacity': 0.2, 'width': 2},Erange=[Emin,Emax],fatbands_scale=0.2,fatbands_mode='area_line',bands_mode='scatter')
fatbands.update_inputs(
    groups=[
        {"atoms": [0,1], "color": "blue", "name": "1"},
        {"atoms": [28,29], "color": "red", "name": "15"},
    ],
)

where band_struct is a `sisl.BandStructure' object.

Questions 1) Considering the high number of k-points I need for this study, is it possible to improve/enable the parallelization? If so. how? [I have 936 orbitals and it takes several hours] 2) Even with only few k-points the method seems rather slow, especially if compared to similar routines that do not exploit the sisl.viz.FatBands methods. Is it possible to speedup the part related to the computation of the projections? 3) I did some test and band_struct.plot.fatbands works only with fatbands_mode="area_line. Ideally, I would like to use scatter, access the weights' data (that might be stored in some array) and customize my plots in the notebook. One of the reason for personal customization, is that I found out that backend="matplotlib" might not always work.

Thank you for your help.

zerothi commented 1 week ago

Let me note here that it depends on your OS etc.

If you are using linux, and you have as cript, then you should do something like this:

export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
python3 script.py

and it should do it automatically. If you are doing this with a jupyter notebook, then I think you should do this:

export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
jupyter notebook notebook.ipynb

note 4 should be changed to whatever you want. As for 3, you should probably open up a new issue for that one.

rreho commented 1 week ago

Thanks, so those are only small change in the submission script and/or notebook. Should I also set pool=True somewhere in my script?

zerothi commented 1 week ago

Thanks, so those are only small change in the submission script and/or notebook. Should I also set pool=True somewhere in my script?

in main this is the (new) default due to this PR, so you shouldn't need it. I would suggest you to play with this locally first, and see how it performs. :)

pfebrer commented 1 week ago

Is it possible to speedup the part related to the computation of the projections?

The fatbands plot just computes the eigenstates in the way that sisl allows. So there would not be speed benefits from doing it yourself "manually" within sisl.

I did some test and band_struct.plot.fatbands works only with fatbands_mode="area_line. Ideally, I would like to use scatter, access the weights' data (that might be stored in some array) and customize my plots in the notebook. One of the reason for personal customization, is that I found out that backend="matplotlib" might not always work.

You can always compute the fatbands yourself using sisl's methods, or take the data from the plot like:

fatbands.nodes["bands_data"].get()

Then you can do whatever you want with the data. But if you have seen issues with the plotting it would be great if you would create a minimal example of the error and create an issue :) Then we can all benefit from the fixes!

rreho commented 1 week ago

Thanks for your comments. I created a new issue with a minimal example.