carryless multiplication builtin

travisstaloch commented 3 years ago

I would like to work toward creating a carryless multiplication builtin in zig. This is a fast instruction used in simdjson for example to convert binary quote boundaries from json strings into masks. In the following example, Q is a 64 bit quote boundary marker. The last line is the result of a carryless multiplication between Q and 0xfffffffffffffff

{ "\\\" Nam[{": [ 116,"\\\\" , 234, "true", false ], "t":"\\\"" }: input data
__1___1_____1________1____1________1____1___________1_1_1___11__ : Q
______1_____________________________________________________1___ : OD
__1_________1________1____1________1____1___________1_1_1____1__ : Q &=~OD
__1111111111_________11111_________11111____________11__11111___ : CLMUL(Q,~0)

In simdjson this is known as prefix_xor and is implemented here:

Here are some references to this instruction in the zig repo:

The llvm x86 intrinsic is llvm.x86.pclmulqdq

Name ideas:

@mulCarryless()
@mulWithoutCarry()

I hope to use this in my simdjson port to get rid off hacky llvm intrinsic calls such as the following which may not be possible in stage 2:

@"llvm.x86.pclmulqdq"(@bitCast(i64x2, a), @bitCast(i64x2, b), 0)

Related to #903

If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.

andrewrk commented 3 years ago

If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.

The good news and bad news is that you are the pioneer of the first such builtin.

One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction. In this case it may emit a call to a compiler-rt function, which we can make sure is implemented for other architectures in addition to x86.

Regardless, I do think that if you start on this feature, it can be tackled one bit at a time and I'd be happy to help at any point along the way.

travisstaloch commented 3 years ago

One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction.

Interesting. Not sure if this is what you meant, but I tried the following. I guess my clang-foo is failing. Any suggestions? Not sure 'i386' is a correct option for -mcpu. I tried several others like 'westmere', 'haswell' w/ same results.

$ clang-12 -c builtin-things.c -o foo.bc -emit-llvm --target=x86_64-linux -mcpu=i386 -mpclmul &&  llvm-dis-12 foo.bc -o foo.ll

clang: warning: argument unused during compilation: '-mcpu=i386' [-Wunused-command-line-argument]

$ cat builtin-things.c
#include <stdint.h>
#include <emmintrin.h>
#include <immintrin.h>

uint64_t prefix_xor(const uint64_t bitmask) {
  __m128i all_ones = _mm_set1_epi8('\xFF');
  __m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
  return _mm_cvtsi128_si64(result);
}

andrewrk commented 3 years ago

$ clang-12 -c builtin-things.c -emit-llvm -S
builtin-things.c:7:20: error: '__builtin_ia32_pclmulqdq128' needs target feature pclmul
  __m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
                   ^
/home/andy/local/llvm12-release/lib/clang/12.0.1/include/__wmmintrin_pclmul.h:45:13: note: expanded from macro '_mm_clmulepi64_si128'
  ((__m128i)__builtin_ia32_pclmulqdq128((__v2di)(__m128i)(X), \
            ^

So now we know how clang handles this situation: compile error

For Zig, we should make the builtin always work, by providing an implementation if necessary.

matu3ba commented 3 years ago

uses different intrinsics per platform (and a custom implementation on arm

You can reuse parts of my open PR #9578 to select the correct function at comptime and expose the respective symbol. Note that I did not implement big endian support (yet), since there is no CI for testing due to LLVM MIPS regression.

Fortunately Rust has implemented usage of this very intrinsic resolving how this gets lowered: https://github.com/rust-lang/stdarch/issues/318.

Feature detection is in LLVM in lib/Support/Host.cpp. Take note to use and reference the MIT release on porting, if possible. There is code in compiler_rt linking that release.

Probably it would also be good to have a central place for intrinsics. LLVM has intrinsics defined in llvm/lib/IR, but zig uses lib/std/special/compiler_rt for compiler_rt stuff. So probably using lib/std/special/intrinsics and according intrinsics.zig should be fine to keep it separate, but indicate that things work similar to compiler_rt. However that can also be decided during review.

travisstaloch commented 3 years ago

I started working on implementing this. I'm currently able to generate the llvm intrinsic but having a name mangling issue that i'm not sure how to fix. I've posted the issue to llvm irc / discord.

The error is:

ld.lld: error: undefined symbol: llvm.x86.pclmulqdq.v2i64

The correct name is just llvm.x86.pclmulqdq. I'm not sure how to get rid of the trailing .v2i64. There must be some way to get irbuilder::CreateIntrinsic to not add the type name. Or maybe there is an alternative to CreateIntrinsic I should be using?

Of course the code is very hacky and messy so far. Just thought i would share my progress incase anyone has any thoughts on this mangling issue or any other thoughts about how to proceed.

travisstaloch commented 3 years ago

looks like the error from my previous comment was solved in llvm-13.

lin72h commented 2 years ago

riscv B extension has clmul too, simde has a cross platform c implementation we can learn from. And one of the use case of clmul is blazing fast clhash for zig cache

farteryhr commented 1 year ago

some other important bit manipulation instructions: PDEP and PEXT (and CLMUL, which can also be used for constructing bitty steps other than crypto algorithms). it could be used in wide variety of data co/decompressing, de/encoding algorithms.

polyfill: https://github.com/zwegner/zp7 (also see how AMD fails)

there are many use cases mentioned on the web:

https://github.com/Forceflow/libmorton/issues/6
https://www.zhihu.com/question/27824125/answer/2621765689 (chinese, database engine developer)
https://haskell-works.github.io/posts/2018-08-22-pdep-and-pext-bit-manipulation-functions.html

"elegance": https://news.ycombinator.com/item?id=20205743

though elegance can't be quantized, in my humble opinion, they're like the new "CLZ CTZ POPCNT triad" as standard bit manipulation units. that's useful, non-trivial to polyfill and makes pain.

please consider also adding them to builtins..

ziglang / zig

carryless multiplication builtin #9631