ziotom78 / Healpix.jl

Healpix library written in Julia
GNU General Public License v2.0
51 stars 18 forks source link

Support for multiple-threads ? #101

Closed arthurmloureiro closed 1 year ago

arthurmloureiro commented 1 year ago

Hi @ziotom78 !

Excellent package! I was testing Healpix.jl against healpy in Python in terms of speed. I noticed the python wrapper of Healpix is a few times faster than the Julia Module due to multithreading in python

> @btime synfast(cls, 4096, 4096)

resulting in 18.015 s (12 allocations: 1.63 GiB).

While in python:

%%time map_test = hp.synfast(cls, 4096, 4096)

resulting in Wall time: 4.64 s.

Are there plans to make Healpix.jl to support multithreading?

Thanks!

ziotom78 commented 1 year ago

Hi @arthurmloureiro , it's a complex story.

Spherical transforms are not implemented directly in Healpix.jl. Instead, it implements functions like synfast by just importing a few functions from Libsharp.jl. This is a Julia wrapper around the C library libsharp2, a C library by @mreineck. Unfortunately, this library is no longer maintained because its author has started a new library, ducc. I am in contact with @mreineck, who is developing a Julia wrapper which I plan to use in Healpix.jl instead of Libsharp.jl. However, it is not clear how much time this will take.

A quick solution to implement multi-threading in Libsharp.jl (and thus in Healpix.jl) would be to add compilation flags like -fopenmp in the creation of the binary for Libsharp_jll, a package created using BinaryBuilder.jl that is used under the hood by Libsharp.jl.

I would gladly review a PR for Libsharp_jll, so if you feel inclined please give a try! You might take some inspiration from a previous PR by @xzackli, where he added a few compiler flags to Libsharp_jll that brought a measureable increase in performance: https://github.com/JuliaPackaging/Yggdrasil/pull/4501.

P.S. At the moment @LeeoBianchi is working to make Libsharp.jl support MPI. His work will however probably end in a new library, possibly Libsharp_MPI.jl: unlike Libsharp.jl, this new library will probably not support Windows.

arthurmloureiro commented 1 year ago

Hi @ziotom78 ,

Thank you so much for your quick response!

I can definitely try to include the -fopenmp flag in Libsharp_jll and tag you as a reviewer :)

I am quite new to Julia, so it may take a while for me to figure out how to properly do this without breaking anything but the PR you linked from @xzackli definitely helps!. I will keep you posted.

LeeoBianchi commented 1 year ago

Hi everyone,

I confirm that an MPI-parallel version of Healpix.jl should be ready soon.

With that being said, @arthurmloureiro if you are looking for a shared-memory muti-threading, it already comes "out of the box" with the current version of Libsharp.jl through OpenMP (no compilation keyword needed), just "Set the environmental variable OMP_NUM_THREADS to control the number of threads used" as written on the README of Libsharp.jl repo.

Healpix.jl naturally inherits the OpenMP parallelization.

Cheers, Leo

arthurmloureiro commented 1 year ago

Hi @LeeoBianchi ,

Not sure if it's because I am testing the code on an M1 Pro from Apple, but the OMP_NUM_THREADS doesn't seem to affect libsharp when I call it from Julia. I benchmarked it against healpy and set the OMP_NUM_THREADS variable to different values, always getting the same time in the benchmarks. I could be doing something wrong...

I am setting this by export OMP_NUM_THREADS=8 before calling my heapix_benchmark.jl script. I also tried adding ENV["OMP_NUM_THREADS"]=8 inside heapix_benchmark.jl. For lmax=nside=4096, I always get the 17-18s I reported. Does this make sense?

Excited to see how the MPI parallel version will work! That would be an immense advantage over the python and C++ implementations of libsharp!

LeeoBianchi commented 1 year ago

With @ziotom78 we already went through this conversation and conclusion was: it is TRICKY to set the environment variable OMP_NUM_THREADS from a Julia script. To have a look into what's actually going on, I suggest you to have an htop session running in the background while testing Healpix.jl sht's.

I remember I was struggling too to change it when I tried, but on my current laptop (Linux POP_OS) I must have OMP_NUM_THREADS already set to the max number of cores (12) by default. In fact, if I try to run an sht transform for at least a couple of seconds, I clearly see the number of threads running on htop popping from 1 to 12.

arthurmloureiro commented 1 year ago

So it's probably that! Using htop, when I run healpy SHT I can see all my cores at work. At the moment, the same doesn't happen with Healpix.jl, but I am testing re-installing OMP and other dependencies to check if that changes :)

ziotom78 commented 1 year ago

I just re-checked build_tarballs.jl and saw that there is no explicit -fopenmp flag used while building libsharp.

Might it be that on Apple M1 architectures the./configure script does not enable OpenMP? This might explain why @LeeoBianchi got it enabled but @arthurmloureiro not.

@arthurmloureiro, a simple way to test this is to try downloading and compiling libsharp2 directly:

git clone https://gitlab.mpcdf.mpg.de/mtr/libsharp.git
cd libsharp
mkdir build
cd build
../configure

and check what the script prints on your computer. On mine, it prints that -fopenmp is added to the list of compiler switches.

giordano commented 1 year ago

macOS doesn't come with OpenMP libraries, you need to install it separately and make the compiler find it. We have to do the same in Yggdrasil, see for example https://github.com/JuliaPackaging/Yggdrasil/blob/ddee828cfc46f274120e6ffd7ced9fccd86921c0/B/blis/build_tarballs.jl#L135-L138

LeeoBianchi commented 1 year ago

macOS doesn't come with OpenMP libraries.

Hmm that's why!!

arthurmloureiro commented 1 year ago

I see! I assumed because python could run things with OpenMP, then Julia would be able to find it too. I do have OpenMP installed now, but rebuilding the Julia packages didn't work.

I am trying the route of installing libsharp2 but that leads me to a spiralling path with countless errors. I have to check again how we packaged libsharp in a code I am sure installs perfectly well and with OpenMP in M1 Macs.

giordano commented 1 year ago

I see! I assumed because python could run things with OpenMP, then Julia would be able to find it too.

It depends on what library your python package, maybe it uses a build which was compiled with an external openmp and it makes sure to install it. Basically what https://github.com/JuliaPackaging/Yggdrasil/pull/6029 for the build used by the Julia package.

mreineck commented 1 year ago

Hi,

concerning multi-threaded SHTs in Julia, the upcoming ducc package should indeed be much easier to handle than libsharp2. It is no longer OpenMP based, but rather uses the standard threading functionality coming with modern C++. So you no longer need any additional components installed besides a C++ compiler to get things going. Also, it will be possible to specify the desired number of threads as a function argument ... no more need to set environment variables.

The main hurdles for the communication between Julia and C++ seem to be addressed now; I can make use of anything resembling a Julia StridedArray and access it from C++ without copying. A demo of the interface (not yet for SHTs, but for non-uniform FFTs) can be seen at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/ducc0/julia/ducc0.jl. If you have a look, and you notice anything bad, please let me know! I'm not a Julia programmer at all, and still learning the basics.

@LeeoBianchi, if your MPI work is already online, I'd be very interested to have a look!

Cheers, Martin

mreineck commented 1 year ago

A small word of caution: if possible, only use MPI-parallel SHTs on multi-node computers, with one MPI task per node and do multithreading within the node. Doing MPI SHTs with one MPI task per core can be pretty inefficient.

mreineck commented 1 year ago

@LeeoBianchi, I can entirely understand your trouble accessing the internal details of libsharp's MPI datatypes; they are quite low-level, and I was never really happy with them. That's also the reason why I decided to remove MPI functionality from the later libraries entirely ... not because I think MPI shouldn't be used, but because I think it should be done at a higher level, not inside the SHT itself.

In principle an alm2map transform with MPI looks like this:

The ducc0 package provides Python bindings for alm2leg and leg2map (and their adjoints), so the nasty numerical work is still done in a compiled language, but the MPI work has been entirely left to a higher-level library or the user, so they can do it exactly in the way they like.

I think I can provide a Julia interface for this functionality in, maybe, two weeks. Do you think that would be a good way forward? The interface is still pretty complicated, but this is unfortunately necessary, since I want to support the largest possible class of spherical pixelizations, not just Healpix.

arthurmloureiro commented 1 year ago

I have just tested the OpenMP implementation again with Libsharp.jl on the M1 Mac and I can tell you that now it works!

The merge from https://github.com/JuliaPackaging/Yggdrasil/pull/6029 seems to have worked perfectly fine!

Screenshot 2022-12-28 at 16 25 00
[ Info: Using 10 thread(s)!
[ Info: Running synfast...
  2.372 s (73 allocations: 1.63 GiB)

Thank you so much @giordano @ziotom78 @LeeoBianchi and @xzackli !