Build OpenCL kernels requiring CL2.0 (needed for __generic args)

szymonlopaciuk commented 4 months ago

Description

It seems that OpenCL on CUDA is quite permissive, whereas on AMD machines, which more closely follow OpenCL spec, CL1.2 is taken as the default version even if the actual device supports higher ones.

The current version of Xtrack contains code incompatible with CL1.2. This is because in some instances local scope values are passed to __global parameters. CL2.0 introduces a default __generic parameter, which accepts either, in a manner similar to CUDA. In particular the function multipole_compute_dpx_dpy_single_particle receives arguments from either memory. Requiring CL2.0 is the easiest fix in this case: the alternatives are to explicitly make two versions of the function, or manually copy from the global memory to local.

I am preparing a PR for Xtrack that fixes issues encountered on AMD, this is a prerequisite for those changes.

Checklist

Mandatory:

[ ] I have added tests to cover my changes
[x] All the tests are passing, including my new ones
[x] I described my changes in this PR description

Optional:

[x] The code I wrote follows good style practices (see PEP 8 and PEP 20).
[x] I have updated the docs in relation to my changes, if applicable
[x] I have tested also GPU contexts

rdemaria commented 4 months ago

I would be careful with OpenCL 2.0, it is not supported well across drivers. 3.0 is better supported, but it does not include many 2.0 features. I could test on the Intel driver and PoCL in case...

szymonlopaciuk commented 4 months ago

Sorry for the late response, I noted it already yesterday. I can experiment with 3.0 too if you think relying on 2.0 is a problem, I had not known this. I had a quick look at the spec and CL3.0 also supports the __generic address space, and it seems that it's all that I/we need.

We could leave the version configurable to the user, though I'm not sure of the benefit of that if our code does not compile on different versions (like currently, under 2.1)...

Do you support the idea of bumping the version in principle? The alternative is working around it, and it will be a bit annoying, though not impossible.

szymonlopaciuk commented 4 months ago

So:

Radeon VII supports CL2.0 but not CL3.0
Neither the T4s nor the Titan Vs support CL2.0 but strangely happily compile with the --std-cl=CL2.0 flag.
POCL is unhappy with 2.0 on the OpenStack Xeon machine. (Interestingly it compiles on a Threadripper, though I suspect that pyopencl is simply picking the wrong device on the heterogenous machine, it looks suspicious.)
Intel OCL Runtime seems happy with 2.0 on both the Threadripper and Xeon.

rdemaria commented 4 months ago

Does pocl choke on the C code or the flag? CPUs have the same memory space, so I expect the feature to be supported.

szymonlopaciuk commented 4 months ago

It chokes on the flag. I will have to check without, I wouldn't be surprised if it still has problems though: I'm seeing in clinfo that it supports CL1.2 exclusively.

szymonlopaciuk commented 4 months ago

Okay, so to complicate matters further, I have confirmed a bug in Xobjects (or more likely in pyopencl): no matter which platform is selected, the first one is actually chosen for running the kernel. It's very sneaky, because the compiler messages seem "okay" (see screenshot), but the actual kernel is not run where we need it. I'm not sure why at this point, I briefly looked at the code of pyopencl and nothing particularly bad jumps out.

I first observed strange behaviour with gpustat where the gpus would light up even if allegedly running on CPU, and confirmed by using vendor specific macros in the kernel code (__NV_CL_C_VERSION in this case for nvidia, which is the one producing the message in all the cases). See the screenshot, it's a bit ugly because I just put a printf in one of the tests, but still:

rdemaria commented 4 months ago

benchmarks on https://github.com/rdemaria/simpletrack works normally for me. I have upgraded last version everything. Interstingly I get:

POCL:  Device: cpu-skylake-avx512-AMD Ryzen 9 7950X3D 16-Core Processor
turns    npart t/turn[ms] t/turn/part[us]
   10    18000     154.68       8.59
   10    25000     254.29      10.17
NVIDIA: Device: NVIDIA TITAN V
turns    npart t/turn[ms] t/turn/part[us]
   10    18000      17.14       0.95
   10    25000      17.74       0.71
INTEL: Device: AMD Ryzen 9 7950X3D 16-Core Processor          
turns    npart t/turn[ms] t/turn/part[us]
   10    18000     415.34      23.07
   10    25000     543.04      21.72

xsuite / xobjects

Build OpenCL kernels requiring CL2.0 (needed for __generic args) #135

Description

Checklist