using a cppyy-based python module to pull in highway SIMD code

kfjahnke commented 2 years ago

I have posted a lengthy post on highway's issue tracker concerning the use of my cppyy-based b-spline module, modified to use highway instead of Vc for SIMD code. I'm posting here because I have an issue with the ISA information which cppyy communicates to the C++ code it consumes - this seems sufficient for Vc, but not for highway. highway also needs additional compiler flags when doing straight C++ compiles - simply passing -mavx2 is not enough for highway, whereas it's enough for Vc. I suspect that in cppyy you use something like -march=native to exploit the current CPU's 'best' ISA, and that the ISA information is made available to the JITed code... somehow. Maybe you could shed some light on this issue?

wlav commented 2 years ago

cppyy checks the CPU and adds -mavx to the Cling CLI arguments if the processor supports it. You can override this behavior by setting the envar EXTRA_CLING_ARGS (which defaults to -O2 -mavx if supported) to something else.

However, there are a couple of catches here. 1) Most likely the PCH has to be rebuild. You can either wipe the current one or set CLING_REBUILD_PCH=1 one time to force it. The reason for this is that many flags additionally enable preprocessor macro's, which will be baked into the PCH. 2) Clang has in many cases functions for vector methods where gcc has intrinsics. These are in Clang specific headers that ship with Cling. Since we're currently on Clang9 (the Clang13 port is in its final stages, but note that the final steps are always the hardest), it will not have support for the latest and greatest architectures. 3) Cling does not enable all of Clang optimization passes in a trade-off between interactivity and performance. That's actually something that should really be revisited when moving to Clang13, as it hasn't been re-evaluated in a long time.

Ideally, Cling would offer (one-off) sub-interpreters where, interactively, compiler flags could be passed in. Upstream is getting more open to the idea, in particular b/c of CUDA. Right now, CUDA support is done by having 2 interpreters sit side by side (cppyy will create and manage 2 different PCHs), which is functional but clunky. (Then again, everything about CUDA is "functional but clunky." :) ) Additionally, I started working on Numba integration and it would be very useful to control certain features (e.g. OpenMP) per Numba trace, so I'll continue to badger them about it.

kfjahnke commented 2 years ago

Thank you! I rebuilt the PCH and passed the extra compiler arguments which highway needs to activate the AVX2 ISA via EXTRA_CLING_ARGS, (-mavx2 -march=haswell -maes -mpclmul) and everything ran smoothly - both JITing the code with cppyy 'all the way down' and linking in precompiled binary. AVX2 is what I have here, anything 'better' I can't test, lacking the hardware.

EXTRA_CLING_ARGS (which defaults to -O2 -mavx if supported)

-mavx? That's a bit of a niche ISA, with the unlucky decision to use float and int vector registers of different size - which was fixed in AVX2 where both are the same. What surprises me is that you would use a fixed set of flags by default. Wouldn't it be better to 'look at' the CPU the process is running on and figure out what it can provide? After all, anything cppyy builds up is for the current CPU only.

(Then again, everything about CUDA is "functional but clunky." :) )

That actually made me laugh :D It's one of the reasons why I decided to use SIMD on a good old CPU for rendering.

wlav commented 2 years ago

Wouldn't it be better to 'look at' the CPU the process is running on and figure out what it can provide?

That's what I meant with "if supported." :) Yes, it checks the cpuinfo flags.

And yes, I could also preferentially check for AVX2. In fact, since now PCHs are build explicitly on specific CUDA features (this was done for CUDA), that can be done transparently.

Aside, upstream disabled this all completely when they forked and backported cppyy: apparently there are folks distributing binaries to clusters of machines with different capabilities, so they just don't want to deal with any of it.

kfjahnke commented 2 years ago

Yes, it checks the cpuinfo flags.

If it looks at the cpuinfo flags, couldn't it then communicate these flags to the code it wraps? The problem I have here is that if I don't explicitly pass the ISA-specific flags via EXTRA_CLING_ARGS, the highway code further down the line only produces SSSE3 code - even on my AVX2 machine. And if I don't pass anything, the Vc code picks up the AVX flag you set by default - AVX is supported, but the CPU could do better.

I don't know if this is technically feasible, but I think it would make sense to try and be as close to the actual target as possible, using -march=native throughout. I just tried using that flag via EXTRA_CLING_ARGS and made my precompiled binaries with it as well, and both came out compatible, so with -march=native all flags which highway needs seem to be passed just right - and at the same time, this should work on all targets, helping portability. I'll recommend that for my module.

distributing binaries to clusters of machines with different capabilities

It may also become an issue with CPUs which have different types of cores (like, E and P).

wlav commented 2 years ago

Yes, -march=native does the trick AFAICT, I'll replace the CPU feature check with it and see whether the math folks notice/complain. :)

Completely aside, in the other thread, you mentioned atan2. Do you have a few more details? It'd be very useful to me to have a faster atan2 accessed through the JIT for demonstration purposes.

As it happens, I was playing with atan2 last week, trying to see whether I could beat libm's version with JITed code using Numba+cppyy, but didn't manage (performance was the same for all versions/settings). My end goal in that project is way more complicated codes (template expressions), but having a simple example to show that this approach is generally useful would be great.

Thanks!

kfjahnke commented 2 years ago

-march=native does the trick AFAICT

Looks like a good solution to me, because it refers to the machine you're on, whatever that may be. And the compiler seems to set all flags accordingly - even highway, which is picky about the CPU flags and e.g. requires more than -mavx2 to emit AVX2 code seems to get all flags it needs to do so.

playing with atan2

The talk about atan2 is about a SIMD version. Vc provides it for i86, highway does not. I ported Vc's version so it can be used by highway, the code is here - but I think this isn't really anything novel, the interesting bit is using it for SIMD code. Anyway, have a look, maybe you find something inspiring. When it comes to SIMD code, having above implementation makes a huge difference, because the std function is not autovectorized. I use it often in my rendering code, so it's a must-have for me.

wlav commented 2 years ago

Thanks! To use that, I'll need to be a few steps further along (both Numpy -> some array class w/o copy and use of C++ objects in the Numba trace w/o extraneous boxing/unboxing). It'd be interesting to see whether that approach can outperform Numba's vectorize at that point (it should, b/c it can remove at least the indirection of the function call which in Numba is even inefficiently implemented).

kfjahnke commented 2 years ago

If you're interested in SIMD, you may want to have another look at vspline. I decided recently to give highway a shot as SIMD backend. This was the fourth SIMD backend for vspline - I already had my own 'goading' implementation using small loops and relying on autovectorization, the Vc implementation, and one based on std::simd. I realized that it might be a good idea to standardize my use of SIMD and isolate it into code with a common interface, which, at the same time, would encapsulate the concrete implementation and hide it from the application. This process is now pretty much done: the 'goading' implementation serves as fallback (for fundamentals which can't be vectorized with a given SIMD library), and it's a matter of a single preprocessor #define which SIMD library is 'slotted in' to provide 'real' SIMD code - and if none is chosen, the entire code falls back to 'goading'.

This new mode of using SIMD is not yet in vspline 'proper' - I tend to implement new features in the vspline copy living in my image/panorama viewer, lux, because there, I have an entire application running the code, exploiting pretty much all of it's features - and because it's a visual application, if anything is amiss you usually spot it because the images come out wrong. The next thing I came up with is trying the new code with python-vspline, and ... here we are again. The new interface code is in *_simd_type.h in bspline/vspline. The simd_type variants each provide a 'vector-like' container type with arithmetic capability, plus the SIMD-typical mask and index types. And vspline has code to 'plough through' nD arrays of data, so that small chunks can be processed by multiple threads using SIMD pipelines on the data. It's very much like ufuncs, or what Numba does, but it's entirely in C++, and the iteration is not over individual array elements, but over the simd_type objects, which the 'ploughing' code puts together, de/interleaving if necessary. The pipeline code is easy to write, so you're up and running quickly, all you need is to set up the target and optionally source arrays, then call vspline::transform with the pipeline functor, and that's it.

Some of the code might lend itself to be used with your projects - I think the wrapper layer to encapsulate the SIMD code is a neat idea, and my multithreading and pipeline-feeding code is very efficient and has been in use for years on various platforms, so I reckon it's stable. I should mention, though, that my code is only for 'xel' data: small homogeneous aggregates like pixels. It's not for structured data.

Why highway? I have high hopes: allegedly it can provide proper SIMD code for pretty much every CPU, not just i86. The next target I have in mind is ARM, especially the M1, which I could - up to now - only run with my 'goading' implementation - Vc does not support it, but highway does NEON (and even SVE). And highway is actively developed and gives good feedback. On the down side, it's not as generic as Vc and sometimes a bit close to bare metal. So the interface is quite large with a few special cases. Potentially, the code in hwy_simd_type.h should work on all platforms highway supports. The problem I have currently is that I don't have any of these machines, so I'm waiting around for other people to help out. I've already set it up so that a test build of lux on M1 could be done with CMake with setting a few build options (a 'native' build, it's described in the docu), but currently my 'M1 colleague' is busy, and the freeBSD port is struggling with attempts at getting the code to run on 32bit ARM machines and hitting the memory limits, so my next task will be refactoring...

kfjahnke commented 2 years ago

I played a bit with atan2, and I have written a little test program which transforms 2D coordinates to polar (using atan2) and back. In single precision float, and with the highway version (that needs a highway installation and use_hwy=True in config.py) I get processing times like 85 msec for cartesian to polar, and 80 msec for polar to cartesian for 100.000.000 coordinates. That's on a haswell core i5 with multithreading.

wlav commented 2 years ago

I'm finding the original to be faster than hwy, most of the time, even though the latter is multithreaded, the former always single? Only every once in a while does the latter run significantly (as in: 45%) faster, so something seems off somewhere. In all cases, the hwy code has some high deltas, though.

I tried a larger set of coordinates (both height and width x3), which stabilized the numbers from run to run, but no difference in final conclusion.

(This on an AMD EPYC 7702P.)

kfjahnke commented 2 years ago

Thanks for testing this on your machine!

I'm finding the original to be faster than hwy, most of the time

I looked into it, and it turns out the program was ill-suited to determine arithmetic speed. What it measured instead (mostly) was efficiency of memory traffic: The program was largely memory-bound. And highway just came up with new de/interleaving code, which probably makes the memory access faster than the one done with Vc.

something seems off somewhere.

I made a version which does the cartesian to polar conversion and back 128 times inside the C++ code, which reduces memory traffic to 1/256. Now the figures come out quite different. Everything times more or less as expected, except singlethreaded code which is working with functors JITed with cppyy. Here, the Vc version takes five times as long as the multithreaded version, and the hwy version even thirty(!) times as long - my system has four cores, so I expect the singlethreaded code to run about four times as long, which it does for the other variants. So something really seems to be off, especially with the hwy version. When multithreading is on, the hwy version clearly comes out the fastest of the lot.

kfjahnke commented 2 years ago

I pushed code to close in on the problem, see 'grind.py'. If you'd like to help figure this out, you can make the binary versions of the functors with 'makefile' in the package root and try and reproduce my findings. Here's sample output:

$ python grind.py --use_hwy --crd_use_binary --singlethread
SIMD library: hwy   use binary: False   use bin. functors: True   multithread: False
grind functor: 17.431698 sec   grind function: 17.434918 sec

$ python grind.py --use_hwy --singlethread
SIMD library: hwy   use binary: False   use bin. functors: False   multithread: False
grind functor: 124.889461 sec   grind function: 124.887715 sec

wlav / cppyy

using a cppyy-based python module to pull in highway SIMD code #63