Detect AVX2 support at runtime

jakirkham commented 6 years ago

Currently users have to decide at compile time if they would like to build a binary that supports AVX2 intrinsics or not. If they build with AVX2 intrinsics and end up deploying to somewhere that lacks AVX2 intrinsics, they will suffer a segfault due to the illegal instruction. Though users can build without AVX2 intrinsics and it will work fine regardless of whether the target infrastructure has AVX2 support, the compression algorithms here may run slower than if they were built with AVX2 support. Admittedly avoiding a segfault is much more important than degraded performance.

However, in the ideal case, we could build numcodecs with and without AVX2 support and then merely detect at runtime whether AVX2 instructions were permitted and thus choose the appropriate code path without crashing in either case. This will take a bit of work to understand where AVX2 instructions are being introduced and how to avoid them. Though some of that was already done in the first referenced issue below.

xref: https://github.com/zarr-developers/zarr/issues/136 xref: https://github.com/zarr-developers/numcodecs/issues/24 xref: https://github.com/zarr-developers/numcodecs/pull/26 xref: https://github.com/zarr-developers/numcodecs/pull/27

alimanfoo commented 6 years ago

The only use of AVX2 intrinsics AFAIK is within c-blosc. @FrancescAlted could you confirm that c-blosc does not perform runtime dispatching based on hardware capabilities? If so, is this feasible?

jakirkham commented 6 years ago

So when we had investigated issue ( https://github.com/zarr-developers/zarr/issues/136 ) last time (admittedly about ~1yr ago). We had narrowed it down to an AVX2 instruction, vinserti128, popping up in __pyx_pw_4zarr_5blosc_19compress, which was used by all compression code paths (except Zlib). @FrancescAlted had previously looked and found that there was no vinserti128 in Blosc. This means it had to have been in the Zarr Cython-generated C code. We decided the solution was to allow one to disable AVX2 instructions at compile time. This works, but comes with caveat that we cannot use AVX2 instructions at run time should they be available.

Now I have not investigated the analogous case since the Zarr/Numcodecs split, but suspect the issue still exists. Can try and generate a new reproducer using newer versions of Zarr and Numcodecs, which should help us understand where this problem occurs now. Looking back at the C code now, would suspect this line to have caused the issue. Fixing this sort of issue may require some trickery on the building end of things.

alimanfoo commented 6 years ago

My apologies, I had forgotten this.

On Mon, 19 Feb 2018 at 18:08, jakirkham notifications@github.com wrote:

So when we had investigated issue ( zarr-developers/zarr#136 https://github.com/zarr-developers/zarr/issues/136 ) last time (admittedly about ~1yr ago). We had narrowed it down to an AVX2 instruction, vinserti128, popping up in __pyx_pw_4zarr_5blosc_19compress https://github.com/zarr-developers/zarr/blob/v2.1.4/zarr/blosc.c#L2803, which was used by all compression code paths (except Zlib) https://github.com/zarr-developers/zarr/issues/136#issuecomment-283232434. @FrancescAlted https://github.com/francescalted had previously looked and found that there was no vinserti128 in Blosc https://github.com/zarr-developers/zarr/issues/136#issuecomment-283304842. This means it had to have been in the Zarr Cython-generated C code. We decided the solution was to allow one to disable AVX2 instructions at compile time. This works, but comes with caveat that we cannot use AVX2 instructions at run time should they be available.

Now I have not investigated the analogous case since the Zarr/Numcodecs split, but suspect the issue still exists. Can try and generate a new reproducer using newer versions of Zarr and Numcodecs, which should help us understand where this problem occurs now. Looking back at the C code now, would suspect this line https://github.com/zarr-developers/zarr/blob/v2.1.4/zarr/blosc.c#L2812 to have caused the issue. Fixing this sort of issue may require some trickery on the building end of things.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/zarr-developers/numcodecs/issues/67#issuecomment-366767783, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qqx1wuQoXoBopWXOxhk-UsuICrWDks5tWbiDgaJpZM4SJeew .

-- If I do not respond to an email within a few days and you need a response, please feel free to resend your email and/or contact me by other means.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Skype: londonbonsaipurple Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

FrancescAlted commented 6 years ago

I confirm that C-Blosc does perform runtime dispatching based on hardware capabilities. In order to better assess if the different acceleration paths are being available to Blosc, I have just implemented the possibility to print the different CPU capabilities that will be used via the BLOSC_PRINT_SHUFFLE_ACCEL environment variable. And yes, it should be possible to activate the AVX2 path just in processors having this capability.

alimanfoo commented 6 years ago

Thanks Francesc.

I am way out of my depth here, but I don't believe there are any AVX2 intrinsic function calls in the Cython-generated C code, and so if there are some AVX2 instructions in the compiled code I guess this must be an optimisation the compiler has figured out by itself. So if we want to compile c-blosc with the potential to use AVX2 when available, but also have the compiled code safe to run on hardware without AVX2, it sounds like we need to be able to tell the compiler something like "if you see an AVX2 intrinsic function call in the source code then go ahead and compile AVX2 instructions, but otherwise do not insert any AVX2 instructions by yourself". I wonder if this could be achieved with gcc via the -o flag, although I don't know if there would be other performance considerations.

On Friday, February 23, 2018, Francesc Alted notifications@github.com wrote:

I confirm that C-Blosc does perform runtime dispatching based on hardware capabilities. In order to better assess if the different acceleration paths are being available to Blosc, I have just implemented the possibility to print the different CPU capabilities that will be used via the BLOSC_PRINT_SHUFFLE_ACCEL environment variable https://github.com/Blosc/c-blosc/commit/dbf989d48cbe92ccbd201629985f684a59e704f3#diff-3b57192dd1ce214552c18801cfb7ae7bR33. And yes, it should be possible to activate the AVX2 path just in processors having this capability.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/numcodecs/issues/67#issuecomment-367951084, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqNUHL-hiEmLmLJVphE11PvmnaLmks5tXn-JgaJpZM4SJeew .

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Skype: londonbonsaipurple Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

FrancescAlted commented 6 years ago

Yeah, I am not an expert either. My hunch is that the -mavx2 flag should not be enforced during the compiler invocation when building the library, and let the cmake machinery to decide whether the compiler supports AVX2 so that it can generate code paths in the binaries. That possibly means that this is going to be difficult to achieve in environments that are not using cmake, but I may be wrong here.

At any rate, I am pinging the guy who did most of the SSE2/AVX2 runtime detection in Blosc some years ago. @juliantaylor any hints on this would be highly appreciated. Thanks in advance!

juliantaylor commented 6 years ago

Correct, -mavx2 allows the compiler to place avx2 code into whatever place it likes.

This piece of code looks like it compiles in avx2 unconditionally, though I am not familiar with this cython feature, it might just be an annotation not used during compilation: https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/vlen.c#L9

If your code that profits from avx2 is inside of non-public cython code called from python it should be pretty easy to compile it twice wrap the appropriate call depending on runtime environment in python. Code to determine cpu features at runtime can be found in e.g. blosc or you can use compiler features (like gcc and newer clang versions __builtin_cpu_supports) This is also assuming cython cannot yet by itself do automatic cpu set specific function cloning and dispatching like gcc (or icc) can.

detrout commented 5 years ago

I'm experiencing this issue on some of my machines.

The kernel thinks the illegal instruction is in the blosc library which seems to be provided by numcodecs

[7486855.845681] traps: python3[201485] trap invalid opcode ip:7f24c9a68b46 sp:7fffd0a84760 error:0
[7486855.845688]  in blosc.cpython-37m-x86_64-linux-gnu.so[7f24c9a64000+a8000]

zarr-developers / numcodecs

Detect AVX2 support at runtime #67