Closed derpycoder closed 2 years ago
The big integer routines aren't too optimized so there may be some algorithmic differences which are affecting this but a large issue is an inefficient toString
implementation.
Take the following modified program:
const std = @import("std");
const Managed = std.math.big.int.Managed;
pub fn main() anyerror!void {
const allocator = std.heap.c_allocator;
var a = try Managed.initSet(allocator, 0);
defer a.deinit();
var b = try Managed.initSet(allocator, 1);
defer b.deinit();
var i: u128 = 0;
while (i < 500000) : (i += 1) {
var c = try Managed.init(allocator);
defer c.deinit();
try c.add(a.toConst(), b.toConst());
a.swap(&b);
b.swap(&c);
}
const as = try a.toString(allocator, 10, std.fmt.Case.lower);
defer allocator.free(as);
std.log.info("Fib: {s}", .{as});
std.debug.print("number of limbs = {}\n", .{a.limbs.len});
}
$ time ./t
number of limbs = 5425
________________________________________________________
Executed in 5.83 secs fish external
usr time 5.70 secs 663.00 micros 5.70 secs
sys time 0.01 secs 88.00 micros 0.01 secs
Removing the toString
and log
results in the following for 500,000.
$ time ./t
number of limbs = 5425
________________________________________________________
Executed in 2.62 secs fish external
usr time 2.56 secs 631.00 micros 2.56 secs
sys time 0.01 secs 81.00 micros 0.01 secs
This looks to increase proprotionally. For reference I got the following times with the conversion to string removed.
1,000,000 10.41s
2,500,000 97.48s
Which looks much more in line with the Golang performance scaling. One would need to investigate toString
more in detail. The algorithm used is fairly trivial for base-10 and requires a lot of divisions in the current state. There may be a smarter algorithm to convert to a string for this or some assumptions we can make note of to improve this.
I am running the same code again, with and without the toString part. And speed is completely different from what I got in the morning. (I think, I might have messed up something to get wrong data.)
As of now, for 1,000,000: With toString, I got 55.4 s, and without toString I got 48 s.
Removing toString didn't improve the performance by that much, so please don't overlook that perhaps the whole implementation has inefficiency strewn into it.
I did some more testing and I'm fairly certain this is related solely to the toString implementation.
$ zig build-exe t.zig -O ReleaseFast -lc
$ perf record ./t
$ perf display
Cycle timings are:
Samples: 774K of event 'cycles:u', Event count (approx.): 680188163038
Overhead Command Shared Object Symbol
53.28% t t [.] compiler_rt.udivmod.udivmod ▒
45.50% t t [.] std.math.big.int.Mutable.addCarry ▒
0.75% t t [.] main ▒
0.08% t t [.] __udivti3 ▒
0.08% t [unknown] [k] 0xffffffff92e010a7 ▒
0.06% t libc-2.33.so [.] __memmove_avx_unaligned_erms ▒
0.06% t libc-2.33.so [.] cfree@GLIBC_2.2.5 ▒
0.05% t libc-2.33.so [.] _int_free ▒
0.04% t libc-2.33.so [.] malloc ▒
0.03% t libc-2.33.so [.] _int_malloc ▒
0.01% t libc-2.33.so [.] _mid_memalign ▒
0.01% t t [.] std.math.big.int.Managed.ensureCapacity ▒
0.01% t t [.] 0x00000000000043a0 ▒
0.01% t t [.] std.heap.CAllocator.alloc ▒
0.01% t libc-2.33.so [.] __posix_memalign ▒
0.01% t t [.] std.heap.CAllocator.free ▒
0.00% t t [.] 0x00000000000043e0 ▒
0.00% t libc-2.33.so [.] __malloc_usable_size ▒
0.00% t t [.] std.heap.CAllocator.resize ▒
0.00% t libc-2.33.so [.] __brk ▒
0.00% t libc-2.33.so [.] systrim.constprop.0 ▒
0.00% t libc-2.33.so [.] sysmalloc ▒
0.00% t t [.] 0x0000000000004380 ▒
0.00% t libc-2.33.so [.] __sbrk ▒
0.00% t libc-2.33.so [.] alloc_perturb ▒
0.00% t t [.] 0x0000000000004390 ▒
0.00% t libc-2.33.so [.] unlink_chunk.constprop.0 ▒
0.00% t [unknown] [k] 0xffffffff92e00158 ▒
0.00% t ld-2.33.so [.] _dl_lookup_symbol_x ▒
0.00% t ld-2.33.so [.] handle_intel.constprop.0 ▒
0.00% t ld-2.33.so [.] _dl_start ▒
The cycle percentages are roughly the same, even as the input parameters increase (500,000 to 2,500,000).
The udivmod
is a compiler-rt division that is generated when doing the single-limb divisions to generate decimal output. The generated output looks a bit poor but the root issue is we don't want to emit a udivmod and should prefer better assembly generated instead. There are ways to do this without requiring the division, see https://gmplib.org/manual/Single-Limb-Division for examples. A simple method we could use that would speed up this case would be to add the concept of a half-limb. We can specialize single-limb division into the current (full limb) and a new half-limb routine which could operate entirely on a usize limb and not have to emit the udivmod instruction.
The relevant code is https://github.com/ziglang/zig/blob/538c9e7bafd859148352e76c1d8053e1ba426a06/lib/std/math/big/int.zig#L3185. Specifically the else case where the mod is performed.
I would suggest adding a new low-level routine (e.g. lldivhalf), a new HalfLimb type, and modifying the higher level div call at https://github.com/ziglang/zig/blob/538c9e7bafd859148352e76c1d8053e1ba426a06/lib/std/math/big/int.zig#L1338 to check if the divisor fits in a half-limb. I expect this should reduce this runtime a lot but should be implemented to confirm.
If you don't think this is correct or have information that suggests otherwise can you do an in-depth profile (e.g. with perf) to display the results you are getting.
And just to clarify on your initial expectations:
| The program should execute and return result in less time than Golang or Rust.
The performance of a big integer library isn't too related to the language, especially as the number sizes grow. The main factor that determines performance are the algorithms and core routines which are called repeatedly. These are often written in assembly for speed. I would expect us to be at least in line with the Golang implementation but don't expect faster performance compared to another library unless the methods used are identical (or have been optimized).
I realized that my micro benchmarking strategy was flawed, because of this.
I won't judge Ziglang based on it's internal library. I found out there are Rust libs with even faster results than the one I tested. So it came down to library and its implementation, which Ziglang will eventually have.
My gripes with it, are:
But the best part of my experience with Ziglang was:
I searched for Ziglang, because of how verbose or flawed everything else is. I want to make games with it. I will wait for it to mature. Looking forward to fast compiler!!
I love Zig. ❤️
@abhijit-kar Just fyi, there are presently two package managers for Zig: Zigmod, and Gyro. There will be an official one in future. You might have better luck with SDL by using SDL.zig too
I still find big int very slow
Zig Version
0.10.0-dev.316+4938fb8f5
Steps to Reproduce
Run the following code using
zig build-exe ./src/main.zig -O ReleaseFast --strip -lc
:Expected Behavior
The program should execute and return result in less time than Golang or Rust.
Actual Behavior
Ziglang took: (On a Mac Mini 2020 M1)
Rust took: (On a slower computer)
Golang took: (On a slower computer)
Also memory usage was upto 5.4 mb for n = 1 million, Rust kept the memory usage under 3 mb even for 10 million!!
Links: Rust & Golang Benchmark StackOverflow Issue