Open travisstaloch opened 3 years ago
If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.
The good news and bad news is that you are the pioneer of the first such builtin.
One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction. In this case it may emit a call to a compiler-rt function, which we can make sure is implemented for other architectures in addition to x86.
Regardless, I do think that if you start on this feature, it can be tackled one bit at a time and I'd be happy to help at any point along the way.
One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction.
Interesting. Not sure if this is what you meant, but I tried the following. I guess my clang-foo is failing. Any suggestions? Not sure 'i386' is a correct option for -mcpu. I tried several others like 'westmere', 'haswell' w/ same results.
$ clang-12 -c builtin-things.c -o foo.bc -emit-llvm --target=x86_64-linux -mcpu=i386 -mpclmul && llvm-dis-12 foo.bc -o foo.ll
clang: warning: argument unused during compilation: '-mcpu=i386' [-Wunused-command-line-argument]
$ cat builtin-things.c
#include <stdint.h>
#include <emmintrin.h>
#include <immintrin.h>
uint64_t prefix_xor(const uint64_t bitmask) {
__m128i all_ones = _mm_set1_epi8('\xFF');
__m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
return _mm_cvtsi128_si64(result);
}
$ clang-12 -c builtin-things.c -emit-llvm -S
builtin-things.c:7:20: error: '__builtin_ia32_pclmulqdq128' needs target feature pclmul
__m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
^
/home/andy/local/llvm12-release/lib/clang/12.0.1/include/__wmmintrin_pclmul.h:45:13: note: expanded from macro '_mm_clmulepi64_si128'
((__m128i)__builtin_ia32_pclmulqdq128((__v2di)(__m128i)(X), \
^
So now we know how clang handles this situation: compile error
For Zig, we should make the builtin always work, by providing an implementation if necessary.
uses different intrinsics per platform (and a custom implementation on arm
You can reuse parts of my open PR #9578 to select the correct function at comptime and expose the respective symbol. Note that I did not implement big endian support (yet), since there is no CI for testing due to LLVM MIPS regression.
Fortunately Rust has implemented usage of this very intrinsic resolving how this gets lowered: https://github.com/rust-lang/stdarch/issues/318.
Feature detection is in LLVM in lib/Support/Host.cpp
. Take note to use and reference the MIT release on porting, if possible. There is code in compiler_rt linking that release.
Probably it would also be good to have a central place for intrinsics.
LLVM has intrinsics defined in llvm/lib/IR
, but zig uses lib/std/special/compiler_rt
for compiler_rt
stuff.
So probably using lib/std/special/intrinsics
and according intrinsics.zig
should be fine to keep it separate, but indicate that things work similar to compiler_rt
. However that can also be decided during review.
I started working on implementing this. I'm currently able to generate the llvm intrinsic but having a name mangling issue that i'm not sure how to fix. I've posted the issue to llvm irc / discord.
The error is:
ld.lld: error: undefined symbol: llvm.x86.pclmulqdq.v2i64
The correct name is just llvm.x86.pclmulqdq
. I'm not sure how to get rid of the trailing .v2i64
. There must be some way to get irbuilder::CreateIntrinsic
to not add the type name. Or maybe there is an alternative to CreateIntrinsic
I should be using?
Of course the code is very hacky and messy so far. Just thought i would share my progress incase anyone has any thoughts on this mangling issue or any other thoughts about how to proceed.
looks like the error from my previous comment was solved in llvm-13.
riscv B extension has clmul too, simde has a cross platform c implementation we can learn from. And one of the use case of clmul is blazing fast clhash for zig cache
some other important bit manipulation instructions: PDEP and PEXT (and CLMUL, which can also be used for constructing bitty steps other than crypto algorithms). it could be used in wide variety of data co/decompressing, de/encoding algorithms.
polyfill: https://github.com/zwegner/zp7 (also see how AMD fails)
there are many use cases mentioned on the web:
"elegance": https://news.ycombinator.com/item?id=20205743
though elegance can't be quantized, in my humble opinion, they're like the new "CLZ CTZ POPCNT triad" as standard bit manipulation units. that's useful, non-trivial to polyfill and makes pain.
please consider also adding them to builtins..
I would like to work toward creating a carryless multiplication builtin in zig. This is a fast instruction used in simdjson for example to convert binary quote boundaries from json strings into masks. In the following example,
Q
is a 64 bit quote boundary marker. The last line is the result of a carryless multiplication betweenQ
and0xfffffffffffffff
In simdjson this is known as
prefix_xor
and is implemented here:Here are some references to this instruction in the zig repo:
The llvm x86 intrinsic is
llvm.x86.pclmulqdq
Name ideas:
@mulCarryless()
@mulWithoutCarry()
I hope to use this in my simdjson port to get rid off hacky llvm intrinsic calls such as the following which may not be possible in stage 2:
Related to #903
If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.