Missing compiler_rt functions

skyfex commented 6 years ago

I was playing around with the latest version of Zig (https://ci.appveyor.com/project/andrewrk/zig-d3l86/build/0.2.0+95f45cfc) on ARM Cortex-M0 and I had a lot of issues with missing compiler_rt functions.

It was missing '__aeabi_memcpy'

It was complaining about missing '__aeabi_uldivmod', which I fixed by commenting out if (isArmArch()) { in compiler_rt/index.zig .. later I found out that it also helped to use --target-arch armv6 instead of thumb, or adding builtin.Arch.thumb to isArmArch (so that should probably be fixed)

After that it complained about __aeabi_h2f, __aeabi_f2h and __multi3

This is the command I used

zig build-exe --static --target-os freestanding --target-arch thumb --target-environ eabihf  --libc-include-dir include --linker-script system\nrf51_xxaa.ld --verbose-link -isystem include --libc-include-dir include --library-path system --assembly system\gcc_startup_nrf51.S --object zig-cache/system_nrf51.o test.zig

Using fmt.bufPrint is what is triggering these errors for me right now.

Is it possible to get a more permanent/robust solution to compiler_rt related problems? Some kind of automated testing that everything is there somehow?

(Btw, good news is that linking with Zig (rather than GCC or LLD) seems to work just fine for me now)

skyfex commented 6 years ago

I'm also often getting these warnings, don't know what they mean: lld: warning: lld may use movt/movw, no object with architecture supporting feature detected.

andrewrk commented 6 years ago

Is it possible to get a more permanent/robust solution to compiler_rt related problems? Some kind of automated testing that everything is there somehow?

So far, I haven't tried to make sure we have all the compiler_rt functions. I've just been adding the ones that were missing when I personally tried something and got these errors. So I've never tried to make sure that we have all of them.

So the first step to getting a permanent/robust solution to compiler_rt related problems is to go ahead and port all the rest of compiler_rt. I thought we had an issue open for that but I'm unable to find it, so this will be that issue.

andrewrk commented 5 years ago

Thanks to @winksaville we now have these functions:

__fixdfdi, __fixdfsi, __fixdfti.
__fixsfdi, __fixsfsi, __fixsfti.
__fixtfdi, __fixtfsi, __fixtfti.

Next steps toward solving this issue is to compile a checklist of all the functions from llvm's compiler-rt builtins/ directory and then start porting them one by one.

tiehuis commented 5 years ago

This list is derived from the following README file.

https://raw.githubusercontent.com/llvm-mirror/compiler-rt/master/lib/builtins/README.txt

There are some platform-specific functions that need to be added here, for example under https://github.com/llvm-mirror/compiler-rt/tree/master/lib/builtins/arm.

Integral bit manipulation

[ ] di_int __ashldi3(di_int a, si_int b); // a << b
[ ] ti_int __ashlti3(ti_int a, si_int b); // a << b
[ ] di_int __ashrdi3(di_int a, si_int b); // a >> b arithmetic (sign fill)
[ ] ti_int __ashrti3(ti_int a, si_int b); // a >> b arithmetic (sign fill)
[ ] di_int __lshrdi3(di_int a, si_int b); // a >> b logical (zero fill)
[ ] ti_int __lshrti3(ti_int a, si_int b); // a >> b logical (zero fill)
[ ] si_int __clzsi2(si_int a); // count leading zeros
[ ] si_int __clzdi2(di_int a); // count leading zeros
[ ] si_int __clzti2(ti_int a); // count leading zeros
[ ] si_int __ctzsi2(si_int a); // count trailing zeros
[ ] si_int __ctzdi2(di_int a); // count trailing zeros
[ ] si_int __ctzti2(ti_int a); // count trailing zeros
[ ] si_int __ffssi2(si_int a); // find least significant 1 bit
[ ] si_int __ffsdi2(di_int a); // find least significant 1 bit
[ ] si_int __ffsti2(ti_int a); // find least significant 1 bit
[ ] si_int __paritysi2(si_int a); // bit parity
[ ] si_int __paritydi2(di_int a); // bit parity
[ ] si_int __parityti2(ti_int a); // bit parity
[ ] si_int __popcountsi2(si_int a); // bit population
[ ] si_int __popcountdi2(di_int a); // bit population
[ ] si_int __popcountti2(ti_int a); // bit population
[ ] uint32_t __bswapsi2(uint32_t a); // a byteswapped
[ ] uint64_t __bswapdi2(uint64_t a); // a byteswapped

Integral arithmetic

[ ] di_int __negdi2 (di_int a); // -a
[ ] ti_int __negti2 (ti_int a); // -a
[ ] di_int __muldi3 (di_int a, di_int b); // a * b
[x] ti_int __multi3 (ti_int a, ti_int b); // a * b
[ ] si_int __divsi3 (si_int a, si_int b); // a / b signed
[ ] di_int __divdi3 (di_int a, di_int b); // a / b signed
[x] ti_int __divti3 (ti_int a, ti_int b); // a / b signed
[x] su_int __udivsi3 (su_int n, su_int d); // a / b unsigned
[x] du_int __udivdi3 (du_int a, du_int b); // a / b unsigned
[x] tu_int __udivti3 (tu_int a, tu_int b); // a / b unsigned
[ ] si_int __modsi3 (si_int a, si_int b); // a % b signed
[ ] di_int __moddi3 (di_int a, di_int b); // a % b signed
[x] ti_int __modti3 (ti_int a, ti_int b); // a % b signed
[ ] su_int __umodsi3 (su_int a, su_int b); // a % b unsigned
[x] du_int __umoddi3 (du_int a, du_int b); // a % b unsigned
[x] tu_int __umodti3 (tu_int a, tu_int b); // a % b unsigned
[x] du_int __udivmoddi4(du_int a, du_int b, du_int* rem); // a / b, *rem = a % b unsigned
[x] tu_int __udivmodti4(tu_int a, tu_int b, tu_int* rem); // a / b, *rem = a % b unsigned
[x] su_int __udivmodsi4(su_int a, su_int b, su_int* rem); // a / b, *rem = a % b unsigned
[ ] si_int __divmodsi4(si_int a, si_int b, si_int* rem); // a / b, *rem = a % b signed

// Integral arithmetic with trapping overflow

[ ] si_int __absvsi2(si_int a); // abs(a)
[ ] di_int __absvdi2(di_int a); // abs(a)
[ ] ti_int __absvti2(ti_int a); // abs(a)
[ ] si_int __negvsi2(si_int a); // -a
[ ] di_int __negvdi2(di_int a); // -a
[ ] ti_int __negvti2(ti_int a); // -a
[ ] si_int __addvsi3(si_int a, si_int b); // a + b
[ ] di_int __addvdi3(di_int a, di_int b); // a + b
[ ] ti_int __addvti3(ti_int a, ti_int b); // a + b
[ ] si_int __subvsi3(si_int a, si_int b); // a - b
[ ] di_int __subvdi3(di_int a, di_int b); // a - b
[ ] ti_int __subvti3(ti_int a, ti_int b); // a - b
[ ] si_int __mulvsi3(si_int a, si_int b); // a * b
[ ] di_int __mulvdi3(di_int a, di_int b); // a * b
[ ] ti_int __mulvti3(ti_int a, ti_int b); // a * b

Integral arithmetic which returns if overflow

[ ] si_int __mulosi4(si_int a, si_int b, int* overflow); // a * b, overflow set to one if result not in signed range
[ ] di_int __mulodi4(di_int a, di_int b, int* overflow); // a * b, overflow set to one if result not in signed range
[x] ti_int __muloti4(ti_int a, ti_int b, int* overflow); // a * b, overflow set to one if result not in signed range

Integral comparison:

a  < b -> 0
a == b -> 1
a  > b -> 2

[ ] si_int __cmpdi2 (di_int a, di_int b);
[ ] si_int __cmpti2 (ti_int a, ti_int b);
[ ] si_int __ucmpdi2(du_int a, du_int b);
[ ] si_int __ucmpti2(tu_int a, tu_int b);

Integral / floating point conversion

[x] di_int __fixsfdi( float a);
[x] di_int __fixdfdi( double a);
[ ] di_int __fixxfdi(long double a);
[x] ti_int __fixsfti( float a);
[x] ti_int __fixdfti( double a);
[ ] ti_int __fixxfti(long double a);
[x] uint64_t __fixtfdi(long double input); // ppc only, doesn't match documentation
[x] su_int __fixunssfsi( float a);
[x] su_int __fixunsdfsi( double a);
[ ] su_int __fixunsxfsi(long double a);
[x] du_int __fixunssfdi( float a);
[x] du_int __fixunsdfdi( double a);
[ ] du_int __fixunsxfdi(long double a);
[x] tu_int __fixunssfti( float a);
[x] tu_int __fixunsdfti( double a);
[ ] tu_int __fixunsxfti(long double a);
[x] uint64_t __fixunstfdi(long double input); // ppc only
[ ] float __floatdisf(di_int a);
[ ] double __floatdidf(di_int a);
[ ] long double __floatdixf(di_int a);
[ ] long double __floatditf(int64_t a); // ppc only
[x] float __floattisf(ti_int a);
[x] double __floattidf(ti_int a);
[ ] long double __floattixf(ti_int a);
[ ] float __floatundisf(du_int a);
[ ] double __floatundidf(du_int a);
[ ] long double __floatundixf(du_int a);
[x] long double __floatunditf(uint64_t a); // ppc only
[x] float __floatuntisf(tu_int a);
[x] double __floatuntidf(tu_int a);
[ ] long double __floatuntixf(tu_int a);

Floating point raised to integer power

[ ] float __powisf2( float a, si_int b); // a ^ b
[ ] double __powidf2( double a, si_int b); // a ^ b
[ ] long double __powixf2(long double a, si_int b); // a ^ b
[ ] long double __powitf2(long double a, si_int b); // ppc only, a ^ b

Complex arithmetic

Following are not required since we do not have language-level complex number support.

`(a + ib) * (c + id)`

~~- [ ] float _Complex __mulsc3( float a, float b, float c, float d);~~ ~~- [ ] double _Complex __muldc3(double a, double b, double c, double d);~~ ~~- [ ] long double _Complex __mulxc3(long double a, long double b, long double c, long double d);~~ ~~- [ ] long double _Complex __multc3(long double a, long double b, long double c, long double d); // ppc only~~

`(a + ib) / (c + id)`

[ ] ~~float _Complex __divsc3( float a, float b, float c, float d);~~
[ ] ~~double _Complex __divdc3(double a, double b, double c, double d);~~
[ ] ~~long double _Complex __divxc3(long double a, long double b, long double c, long double d);~~
[ ] ~~long double _Complex __divtc3(long double a, long double b, - [ ] long double c, long double d); // ppc only~~

Omitted Runtime support functions

Power PC specific functions

adds two 128-bit double-double precision values ( x + y )

[ ] long double __gcc_qadd(long double x, long double y);

subtracts two 128-bit double-double precision values ( x - y )

[ ] long double __gcc_qsub(long double x, long double y);

multiples two 128-bit double-double precision values ( x * y )

[ ] long double __gcc_qmul(long double x, long double y);

divides two 128-bit double-double precision values ( x / y )

[ ] long double __gcc_qdiv(long double a, long double b);

ARM specific functions

Undocumented functions

[ ] float __addsf3vfp(float a, float b); // Appears to return a + b
[ ] double __adddf3vfp(double a, double b); // Appears to return a + b
[ ] float __divsf3vfp(float a, float b); // Appears to return a / b
[ ] double __divdf3vfp(double a, double b); // Appears to return a / b
[ ] int __eqsf2vfp(float a, float b); // Appears to return one iff a == b and neither is NaN.
[ ] int __eqdf2vfp(double a, double b); // Appears to return one iff a == b and neither is NaN.
[ ] double __extendsfdf2vfp(float a); // Appears to convert from float to double.
[ ] int __fixdfsivfp(double a); // Appears to convert from double to int.
[ ] int __fixsfsivfp(float a); // Appears to convert from float to int.
[ ] unsigned int __fixunssfsivfp(float a); // Appears to convert from float to unsigned int.
[ ] unsigned int __fixunsdfsivfp(double a); // Appears to convert from double to unsigned int.
[ ] double __floatsidfvfp(int a); // Appears to convert from int to double.
[ ] float __floatsisfvfp(int a); // Appears to convert from int to float.
[ ] double __floatunssidfvfp(unsigned int a); // Appears to convert from unisgned int to double.
[ ] float __floatunssisfvfp(unsigned int a); // Appears to convert from unisgned int to float.
[ ] int __gedf2vfp(double a, double b); // Appears to return __gedf2 (a >= b)
[ ] int __gesf2vfp(float a, float b); // Appears to return __gesf2 (a >= b)
[ ] int __gtdf2vfp(double a, double b); // Appears to return __gtdf2 (a > b)
[ ] int __gtsf2vfp(float a, float b); // Appears to return __gtsf2 (a > b)
[ ] int __ledf2vfp(double a, double b); // Appears to return __ledf2 (a <= b)
[ ] int __lesf2vfp(float a, float b); // Appears to return __lesf2 (a <= b)
[ ] int __ltdf2vfp(double a, double b); // Appears to return __ltdf2 (a < b)
[ ] int __ltsf2vfp(float a, float b); // Appears to return __ltsf2 (a < b)
[ ] double __muldf3vfp(double a, double b); // Appears to return a * b
[ ] float __mulsf3vfp(float a, float b); // Appears to return a * b
[ ] int __nedf2vfp(double a, double b); // Appears to return __nedf2 (a != b)
[ ] double __negdf2vfp(double a); // Appears to return -a
[ ] float __negsf2vfp(float a); // Appears to return -a
[ ] float __negsf2vfp(float a); // Appears to return -a
[ ] double __subdf3vfp(double a, double b); // Appears to return a - b
[ ] float __subsf3vfp(float a, float b); // Appears to return a - b
[ ] float __truncdfsf2vfp(double a); // Appears to convert from double to float.
[ ] int __unorddf2vfp(double a, double b); // Appears to return __unorddf2
[ ] int __unordsf2vfp(float a, float b); // Appears to return __unordsf2

radek-senfeld commented 5 years ago

After that it complained about __aeabi_h2f, __aeabi_f2h and __multi3

I had the same problem. It turns out you need to compile for gnueabi target in order to link using GCC ld.

EABI names: __aeabi_f2h, __aeabi_h2f GNUEABI names: __gnu_f2h_ieee, __gnu_h2f_ieee

This works for me:

zig build-obj -target armv7m-freestanding-gnueabi --output-dir build/ /opt/zig/lib/zig/std/special/compiler_rt.zig

zig build-obj -target armv7m-freestanding-gnueabi --output-dir build/ src/main.zig

arm-none-eabi-g++ -o build/bluepill.elf --static -nostartfiles -Wl,--gc-sections -T ld/stm32f103c8.ld --specs=nosys.specs lib/libopencm3/lib/libopencm3_stm32f1.a build/compiler_rt.o build/debug.o build/stm32.o build/stm32f1.o build/main.o -Llib/libopencm3/lib -lc -lm -lgcc -lnosys -lopencm3_stm32f1

andrewrk commented 5 years ago

@radek-senfeld we can add __aeabi_f2h, __aeabi_h2f, and __multi3 and then you should be able to zig build-exe -target armv7m-freestanding-eabi directly, no dependency on a cross compiled g++.

It's pretty easy to add compiler-rt functions. I can try to get those in today. Were there any other missing ones for you besides these 3?

radek-senfeld commented 5 years ago

Hi Andrew, at first I want to thank you for your amazing project! I'm hooked to Zig!

Well, my motivation was that I just had to know what is the problem and why it doesn't work. After spending quite a few hours digging around I now much better understand how things work under the hood.

Well, there's apparently no issue with EABIs when Zig links the file. Running zig build-exe -target armv7m-freestanding-eabi src/main.zig doesn't emit any errors.

Which is quite strange because of ommited opencm3_stm32f1 library. It should babble about missing symbols, shouldn't it?

Were there any other missing ones for you besides these 3?

After linking with compiler_rt.o everything I've tried has been sorted out. I just wanted to know if there's chance to have fp formatting in the firmware. Unfortunately the size of this feature is prohibitive (~100kB) at this moment. I've tried just a simple test:

const value: f32 = 3.1415;
debug.message("test! value: {}", value);

There's one more issue but I'm not quite sure about the cause yet. It just hangs the MCU. I need to inspect it using a debugger. This causes MCU to hang:

const value: u32 = 5;
debug.message("test! value: {}", value);

Environment:

rush@jarvis:~$ zig version
0.3.0+6acabd6b

andrewrk commented 5 years ago

Hi Andrew, at first I want to thank you for your amazing project! I'm hooked to Zig!

Thank you for the compliment and I'm happy that you like it.

It should babble about missing symbols, shouldn't?

Only if the symbols are called or used. Perhaps this is zig's lazy analysis of top level declarations? If you do not call a function then it does not get analyzed or included in the result.

andrewrk commented 5 years ago

I just wanted to know if there's chance to have fp formatting in the firmware. Unfortunately the size of this feature is prohibitive (~100kB) at this moment.

Ah that's interesting. Here we have a good use case for perhaps selecting a different implementation of floating point formatting when the --release-small mode is selected. This is a related issue: #1299. @tiehuis has a work-in-progress implementation of this and probably knows how much smaller of a payload it would be than errol3. It's also possible that we include another floating point printing algorithm that is optimized for code size rather than performance.

Feel free to open a new issue which is dedicated to exploring your exact use case. We can comment back and forth there and perhaps learn some new Zig issues that need to be filed.

radek-senfeld commented 5 years ago

Only if the symbols are called or used. Perhaps this is zig's lazy analysis of top level declarations? If you do not call a function then it does not get analyzed or included in the result.

You're probably right.

I guess it's caused by me not specifying a linker script. Which means entry point isn't defined thus fn main() is not linked in and no symbols are missing because they are not used.

radek-senfeld commented 5 years ago

Ah that's interesting. Here we have a good use case for perhaps selecting a different implementation of floating point formatting when the --release-small mode is selected. This is a related issue: #1299. @tiehuis has a work-in-progress implementation of this and probably knows how much smaller of a payload it would be than errol3. It's also possible that we include another floating point printing algorithm that is optimized for code size rather than performance.

Oh, my bad. Now I feel a bit stupid. When compiled using --release-small the resulting binary is actually < 9kB!

Floating-point formatting enabled:

const value: f32 = 3.1415;
debug.message("test! value: {}", value);

Here is the summary:

no release specified (debug, right?): 108016B
--release-fast: 8524B
--release-safe: 13068B
--release-small: 8532B

andrewrk commented 5 years ago

The above commit adds:

__mulsf3
__muldf3
__multf3

magv commented 5 years ago

Hi, everyone. Can we also get __ashlti3 and __lshrti3?

These two prevent the float formatting via fmt.bufPrint from being used under the wasm32-freestanding target. They show up as extra imports:

  (type $t0 (func (param i32 i64 i64 i32)))
  (import "env" "__ashlti3" (func $__ashlti3 (type $t0)))
  (import "env" "__lshrti3" (func $__lshrti3 (type $t0)))

Note that because the imports have i64 in the arguments, it's not even possible to provide them from the JavaScript side. In fact, the browsers refuse to compile the wasm files zig is producing here.

andrewrk commented 5 years ago

Thanks to @LemonBoy, __ashlti3 and __lshrti3 are now available for all targets, including wasm32.

magv commented 5 years ago

Thanks, @LemonBoy and Andrew; fmt.bufPrint works under wasm now.

matu3ba commented 3 years ago

Unfortunately I can not tick the boxes from @tiehuis , so I will add it here.

NOTE: libgcc changed the definition and LLVM did not update it in compiler_rt yet.

Integral bit manipulation

[x] di_int __ashldi3(di_int a, si_int b); // a << b
[x] ti_int __ashlti3(ti_int a, si_int b); // a << b
[x] di_int __ashrdi3(di_int a, si_int b); // a >> b arithmetic (sign fill)
[x] ti_int __ashrti3(ti_int a, si_int b); // a >> b arithmetic (sign fill)
[x] di_int __lshrdi3(di_int a, si_int b); // a >> b logical (zero fill)
[x] ti_int __lshrti3(ti_int a, si_int b); // a >> b logical (zero fill)
[x] si_int __clzsi2(si_int a); // count leading zeros
[x] si_int __clzdi2(di_int a); // count leading zeros
[x] si_int __clzti2(ti_int a); // count leading zeros
[x] si_int __ctzsi2(si_int a); // count trailing zeros
[x] si_int __ctzdi2(di_int a); // count trailing zeros
[x] si_int __ctzti2(ti_int a); // count trailing zeros
[x] si_int __ffssi2(si_int a); // find least significant 1 bit => identical as __ctzsi except for a=0
[x] si_int __ffsdi2(di_int a); // find least significant 1 bit => identical as __ctzdi except for a=0
[x] si_int __ffsti2(ti_int a); // find least significant 1 bit => identical as __ctzti except for a=0
[x] si_int __paritysi2(si_int a); // bit parity
[x] si_int __paritydi2(di_int a); // bit parity
[x] si_int __parityti2(ti_int a); // bit parity
[x] si_int __popcountsi2(si_int a); // bit population
[x] si_int __popcountdi2(di_int a); // bit population
[x] si_int __popcountti2(ti_int a); // bit population
[x] uint32_t __bswapsi2(uint32_t a); // a byteswapped
[x] uint64_t __bswapdi2(uint64_t a); // a byteswapped

Integral arithmetic

NOTE: ti ones look architecture specific, so unsure if complete.

[x] di_int __negdi2 (di_int a); // -a
[x] ti_int __negti2 (ti_int a); // -a
[x] di_int __muldi3 (di_int a, di_int b); // a * b
[x] ti_int __multi3 (ti_int a, ti_int b); // a * b
[x] si_int __divsi3 (si_int a, si_int b); // a / b signed
[x] di_int __divdi3 (di_int a, di_int b); // a / b signed
[x] ti_int __divti3 (ti_int a, ti_int b); // a / b signed
[x] su_int __udivsi3 (su_int n, su_int d); // a / b unsigned
[x] du_int __udivdi3 (du_int a, du_int b); // a / b unsigned
[x] tu_int __udivti3 (tu_int a, tu_int b); // a / b unsigned
[x] si_int __modsi3 (si_int a, si_int b); // a % b signed
[x] di_int __moddi3 (di_int a, di_int b); // a % b signed
[x] ti_int __modti3 (ti_int a, ti_int b); // a % b signed
[x] su_int __umodsi3 (su_int a, su_int b); // a % b unsigned
[x] du_int __umoddi3 (du_int a, du_int b); // a % b unsigned
[x] tu_int __umodti3 (tu_int a, tu_int b); // a % b unsigned
[x] du_int __udivmoddi4(du_int a, du_int b, du_int rem); // a / b, rem = a % b unsigned (TODO sort)
[x] tu_int __udivmodti4(tu_int a, tu_int b, tu_int rem); // a / b, rem = a % b unsigned
[x] su_int __udivmodsi4(su_int a, su_int b, su_int rem); // a / b, rem = a % b unsigned
[x] si_int __divmodsi4(si_int a, si_int b, si_int rem); // a / b, rem = a % b signed

Integral arithmetic with trapping overflow

[x] si_int __absvsi2(si_int a); // abs(a)
[x] di_int __absvdi2(di_int a); // abs(a)
[x] ti_int __absvti2(ti_int a); // abs(a)
[x] si_int __negvsi2(si_int a); // -a
[x] di_int __negvdi2(di_int a); // -a
[x] ti_int __negvti2(ti_int a); // -a
[ ] si_int __addvsi3(si_int a, si_int b); // a + b (blocked by testing panics not working)
[ ] di_int __addvdi3(di_int a, di_int b); // a + b
[ ] ti_int __addvti3(ti_int a, ti_int b); // a + b
[ ] si_int __subvsi3(si_int a, si_int b); // a - b
[ ] di_int __subvdi3(di_int a, di_int b); // a - b
[ ] ti_int __subvti3(ti_int a, ti_int b); // a - b
[ ] si_int __mulvsi3(si_int a, si_int b); // a * b
[ ] di_int __mulvdi3(di_int a, di_int b); // a * b
[ ] ti_int __mulvti3(ti_int a, ti_int b); // a * b

Integral arithmetic which returns if overflow

[x] si_int __mulosi4(si_int a, si_int b, int overflow); // a b, overflow set to one if result not in signed range. 16-17x perf vs llvm!
[x] di_int __mulodi4(di_int a, di_int b, int overflow); // a b, overflow set to one if result not in signed range
[x] ti_int __muloti4(ti_int a, ti_int b, int overflow); // a b, overflow set to one if result not in signed range

Integral comparison:

a  < b -> 0
a == b -> 1
a  > b -> 2

[x] si_int __cmpdi2 (di_int a, di_int b);
[x] si_int __cmpti2 (ti_int a, ti_int b);
[x] si_int __ucmpdi2(du_int a, du_int b);
[x] si_int __ucmpti2(tu_int a, tu_int b);

Integral / floating point conversion

TODO sort these

[x] di_int __fixsfdi( float a);
[x] di_int __fixdfdi( double a);
[ ] di_int __fixxfdi(long double a);
[x] ti_int __fixsfti( float a);
[x] ti_int __fixdfti( double a);
[ ] ti_int __fixxfti(long double a);
[x] uint64_t __fixtfdi(long double input); // ppc only, doesn't match documentation
[x] su_int __fixunssfsi( float a);
[x] su_int __fixunsdfsi( double a);
[ ] su_int __fixunsxfsi(long double a);
[x] du_int __fixunssfdi( float a);
[x] du_int __fixunsdfdi( double a);
[ ] du_int __fixunsxfdi(long double a);
[x] tu_int __fixunssfti( float a);
[x] tu_int __fixunsdfti( double a);
[ ] tu_int __fixunsxfti(long double a);
[x] uint64_t __fixunstfdi(long double input); // ppc only
[x] float __floatdisf(di_int a);
[x] double __floatdidf(di_int a);
[ ] long double __floatdixf(di_int a);
[x] long double __floatditf(int64_t a); // ppc only
[x] float __floattisf(ti_int a);
[x] double __floattidf(ti_int a);
[ ] long double __floattixf(ti_int a);
[x] float __floatundisf(du_int a);
[x] double __floatundidf(du_int a);
[ ] long double __floatundixf(du_int a);
[x] long double __floatunditf(uint64_t a); // ppc only
[x] float __floatuntisf(tu_int a);
[x] double __floatuntidf(tu_int a);
[ ] long double __floatuntixf(tu_int a);

matu3ba commented 3 years ago

@andrewrk The signedness between libgcc 4.9 and current libgcc has changed for the 4.1.4 Bit operations from http://www.chiark.greenend.org.uk/doc/gcc-4.9-doc/gccint.html#index-_005f_005fclzsi2 (Bit operations). How should we deal with this?

libgcc says and that what the README in compiler_rt explicitly refers to:

Runtime Function: int __clzsi2 (unsigned int a)
Runtime Function: int __clzdi2 (unsigned long a)
Runtime Function: int __clzti2 (unsigned long long a)

    These functions return the number of leading 0-bits in a, starting at the most significant bit position. If a is zero, the result is undefined.

Here is the specification for this library:

http://gcc.gnu.org/onlinedocs/gccint/Libgcc.html#Libgcc
...
Here is a synopsis of the contents of this library:

typedef  int32_t si_int;
typedef uint32_t su_int;

typedef  int64_t di_int;
typedef uint64_t du_int;

However in the implementation they use (ie file clzsi2.c):

COMPILER_RT_ABI int __clzsi2(si_int a) {  //  < --------------THIS IS int32_t --------------
...

The implementation of counting leading zeros looks also very odd, as

unsigned int v;
unsigned r = 0;

while (v >>= 1) {
    r++;
}

is much shorter (taken from here). Doesnt work with big endian though.

zhaozg commented 3 years ago

focus, I hit ld.lld: error: undefined symbol: __divdc3 when zig cc -target x86_64-linux-gnu build https://github.com/facebookarchive/luaffifb

[ 98%] Linking C executable luvi
ld.lld: error: undefined symbol: __muldc3
>>> referenced by ffi.c
>>>               CMakeFiles/ffi.dir/thirdparty/ffi/ffi.c.o:(check_complex_double) in archive luv.dir/lua/libffi.a
>>> referenced by ffi.c
>>>               CMakeFiles/ffi.dir/thirdparty/ffi/ffi.c.o:(check_complex_float) in archive luv.dir/lua/libffi.a
>>> referenced by ffi.c
>>>               CMakeFiles/ffi.dir/thirdparty/ffi/ffi.c.o:(set_value) in archive luv.dir/lua/libffi.a
>>> referenced 1 more times
>>> did you mean: __muldf3
>>> defined in: /Users/zhaozg/.cache/zig/o/6fbc72a7825d842d0abf7291021696e1/libcompiler_rt.a

ld.lld: error: undefined symbol: __divdc3
>>> referenced by ffi.c
>>>               CMakeFiles/ffi.dir/thirdparty/ffi/ffi.c.o:(cdata_div) in archive luv.dir/lua/libffi.a

matu3ba commented 3 years ago

If anyone besides me is working on this, you can use older releases (<80) of LLVM as hinted here ie release tag 80 builtins and release tag 80 unit test.

Alternatively https://bits.stephan-brumme.com/, http://aggregate.org/MAGIC/ and http://graphics.stanford.edu/~seander/bithacks.html#ParityParallel are good sources besides Hackers Delight. Hackers Delight also has acommpanied code.

There is also this wiki https://www.chessprogramming.org/Bit-Twiddling with nice resources.

collection of some tricks, go bit twiddling API. fefes blog.

matu3ba commented 2 years ago

Also we really want to improve the compiler-rt libraries as they lead to very bad codegen.

andrewrk commented 2 years ago

Agreed! More optimized, well-tested compiler-rt implementations are welcome, and definitely within scope of the Zig project.

matu3ba commented 2 years ago

After manual inspection of the assmebly generated from popcount, the CPU simulator shows me ~5% performance penalty vs optimized assembly on x86_64 architectures, but for legal reasons I dont want to include and link the comparion here (128 bit popcount): link

__popcountti2:
        mov     rax, rsi
        shr     rax
        movabs  r8, 6148914691236517205
        and     rax, r8
        sub     rsi, rax
        movabs  rax, 3689348814741910323
        mov     rcx, rsi
        and     rcx, rax
        shr     rsi, 2
        and     rsi, rax
        add     rsi, rcx
        mov     rcx, rsi
        shr     rcx, 4
        add     rcx, rsi
        movabs  r9, 1085102592571150095
        and     rcx, r9
        movabs  rdx, 72340172838076673
        imul    rcx, rdx
        shr     rcx, 56
        mov     rsi, rdi
        shr     rsi
        and     rsi, r8
        sub     rdi, rsi
        mov     rsi, rdi
        and     rsi, rax
        shr     rdi, 2
        and     rdi, rax
        add     rdi, rsi
        mov     rax, rdi
        shr     rax, 4
        add     rax, rdi
        and     rax, r9
        imul    rax, rdx
        shr     rax, 56
        add     eax, ecx
        ret

Generated on llvm-mca: Instructions: 3700 vs 3800 with -mcpu=haswell and Total Cycles: 914 vs 1016. My left hypothesis is that codegen by LLVM is significantly worse for architectures without register renaming. Another unclear things is if and how power consumption deviates. Other unclear things are, how good the CPU simulator works. The better one from icu still needs to be tested, but doesnt have a web interface.

matu3ba commented 2 years ago

~~The fastest routine to do multiplication overflow checks is via finite field arithmetic via Galois Linear Feedback Shift Register.~~

The best introduction and overview work (kinda like a goldmine) on Linear Feedback Shift Register techniques is "Linear Feedback Shift Registers for the Uninitiated, Part I: Ex-Pralite Monks and Finite Fields" by Jason Sachs, who has !16! very long introductions into various performance optimization techniques for integer stuff and likely also describes how to easily derive algorithms users may want for optimal performance in a related library.

As I understand compiler_rt, it is for embedded devices without hardware capabilities, and thus space optimized. Unless there is a hardware routine.

andrewrk commented 2 years ago

As I understand compiler_rt, it is for embedded devices without hardware capabilities, and thus space optimized.

For Zig it actually has multiple purposes which can be determined by inspecting @import("builtin").mode. When the programmer builds their application with -OReleaseSmall, then Zig builds compiler-rt with -OReleaseSmall; otherwise Zig builds compiler-rt with -OReleaseFast.

Our compiler-rt has the ability to choose different implementations depending on the desired mode.

vladfaust commented 2 years ago

By the way, in https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html Andrew says:

Zig's compiler-rt is not yet complete. However, completing it is a prerequisite for releasing Zig version 1.0.0.

Shouldn't the 1.0.0 milestone be referencing this issue then?

matu3ba commented 2 years ago

@vladfaust The base stuff is implemented except the overflowing check primitives, which are semi-blocked by testing panics, ~~which in turn is blocked by an OS-independent socket abstraction~~: IPC to write the result from within the spawned process is the most sane/simple way for testing panics, which does not compromise panic handler design.

~~Testing architectures without OS-layer (without process isolation and IPC) would be deferred, as they may compromise panic handler design.~~ embedded devs are expected to use their own panic handler and modify _start etc.

Only thing I am dissatisfied (as its not ready yet) is an overflow multiplication improvement that does not rely on division, but I should be able to finish this within 1-2 days after staring long enough at the paper. approach from paper does not work. mulv and the other routines in Integral arithmetic with trapping overflow will be upstreamed after testing panics work:

__addvsi3
__addvdi3
__addvti3
__subvsi3
__subvdi3
__subvti3
__mulvsi3
__mulvdi3
__mulvti3

see #1356

perf for mulo is ~17x faster than the llvm implementation (measured on my skylake laptop), which might make Zig compiler_rt be very worth to use in external code. mulo bench. I did not measure the Zig internal one yet.

The perf gain for wrapping addition and subtraction instead of the simple approach are in the range of 10% for external code and the Zig internal one without pointers will have ~15% improvements for wrapping addition and subtraction. addo benches.

So all in all, its almost finished: https://github.com/ziglang/zig/issues/1290#issuecomment-819899869

compiler_rt version 2.0 would then be to use the hw accelerations for the routines for all tier 1 targets and figure out how to track this in a sane way.

matu3ba commented 2 years ago

Personally I would favor keeping compiler_rt small and readable and accept 5-8% inefficiency (very rough estimate extrapolating from the popcount speed difference) vs hand-rolled assembly. However, in case we want to tune it further or have an "gimme the experimental compiler_rt switch", we can use:

https://github.com/nadavrot/memset_benchmark
TODO add more freely licensed optimizations

On the side of panic testing to get this finished, I plan

port+fix missing signal handling
port posix_spawn to use instead of fork.
Move pipe parts into pipe.zig with PipeImpl for comptime-configuration of pipes.

EDIT1: progress, but needs benchmarking against realistic workloads to prevent regressions: https://github.com/ziglang/zig/pull/11701 binding to libc posix_spawn need debugging, port of posix_spawn incomplete, porting signal handling is missing a strategy for Zig (Zig libc guarantees vs caller responsibility and to what degree + how to deal with inconsistency/complexity of syscall interaction).

andrewrk commented 1 year ago

Next step to solving this issue: we need an updated checklist of what is still missing.

matu3ba commented 1 year ago

@andrewrk See #13261 for the missing work items. addv,subv,mulv can be trivially copied, if you are okay without testing the panics. I think all tests can be trivially extended or derived from the other implementations except for a^b for complex numbers.

rjzak commented 1 year ago

I just tried to compile 0.10 for powerpc64le-linux-gnu, and found these were missing: __gcc_qmul, __gcc_qdiv, __gcc_qadd when at the last stage compiling zig-bootstrap.

[100%] Building stage3
LLD Link... ld.lld: error: undefined symbol: __gcc_qmul
>>> referenced by hashtable_c++0x.o:(std::__detail::_Prime_rehash_policy::_M_next_bkt(unsigned long) const) in archive /usr/lib64/libstdc++.a
>>> referenced by hashtable_c++0x.o:(std::__detail::_Prime_rehash_policy::_M_next_bkt(unsigned long) const) in archive /usr/lib64/libstdc++.a
>>> referenced by hashtable_c++0x.o:(std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const) in archive /usr/lib64/libstdc++.a

ld.lld: error: undefined symbol: __gcc_qdiv
>>> referenced by hashtable_c++0x.o:(std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const) in archive /usr/lib64/libstdc++.a

ld.lld: error: undefined symbol: __gcc_qadd
>>> referenced by hashtable_c++0x.o:(std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const) in archive /usr/lib64/libstdc++.a

Edit: resolved by using gcc 12.0 which I had compiled from source instead of using gcc 10.2 provided by Void Linux.

matu3ba commented 1 year ago

Suggestion to close this in favor of #15675.

The tracking issues can be searched via the following patterns: compiler_rt: Tracking Issue XYZ Routines.

ziglang / zig