ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
35.09k stars 2.56k forks source link

Super slow Big Int Implementation in Zig #10630

Closed derpycoder closed 2 years ago

derpycoder commented 2 years ago

Zig Version

0.10.0-dev.316+4938fb8f5

Steps to Reproduce

Run the following code using zig build-exe ./src/main.zig -O ReleaseFast --strip -lc:

const std = @import("std");
const Managed = std.math.big.int.Managed;

pub fn main() anyerror!void {
    const allocator = std.heap.c_allocator;

    var a = try Managed.initSet(allocator, 0);
    defer a.deinit();
    var b = try Managed.initSet(allocator, 1);
    defer b.deinit();
    var i: u128 = 0;

    while (i < 1000000) : (i += 1) {
        var c = try Managed.init(allocator);
        defer c.deinit();
        try c.add(a.toConst(), b.toConst());

        a.swap(&b);
        b.swap(&c);
    }

    const as = try a.toString(allocator, 10, std.fmt.Case.lower);
    defer allocator.free(as);
    std.log.info("Fib: {s}", .{as});
}

Expected Behavior

The program should execute and return result in less time than Golang or Rust.

Actual Behavior

Ziglang took: (On a Mac Mini 2020 M1)

   500,000 took   14.18 s.
 1,000,000 took   57.90 s.
...

Rust took: (On a slower computer)

   500,000 took   1.07 s.   (104,494 digits)
 1,000,000 took   3.97 s.   (208,988 digits)
 5,000,000 took 100.95 s. (1,044,938 digits) [1.6 Minutes]
10,000,000 took 406.98 s. (2,089,877 digits) [6.7 Minutes]

Golang took: (On a slower computer)

   500,000 took    3.62 s.   (104,494 digits)
 1,000,000 took   15.11 s.   (208,988 digits)
 5,000,000 took  386.85 s. (1,044,938 digits)  [6.4 Minutes]
10,000,000 took 1896.96 s. (2,089,876 digits) [31.6 Minutes] 😅

Also memory usage was upto 5.4 mb for n = 1 million, Rust kept the memory usage under 3 mb even for 10 million!!

Links: Rust & Golang Benchmark StackOverflow Issue

tiehuis commented 2 years ago

The big integer routines aren't too optimized so there may be some algorithmic differences which are affecting this but a large issue is an inefficient toString implementation.

Take the following modified program:

const std = @import("std");
const Managed = std.math.big.int.Managed;

pub fn main() anyerror!void {
    const allocator = std.heap.c_allocator;

    var a = try Managed.initSet(allocator, 0);
    defer a.deinit();
    var b = try Managed.initSet(allocator, 1);
    defer b.deinit();
    var i: u128 = 0;

    while (i < 500000) : (i += 1) {
        var c = try Managed.init(allocator);
        defer c.deinit();
        try c.add(a.toConst(), b.toConst());

        a.swap(&b);
        b.swap(&c);
    }

    const as = try a.toString(allocator, 10, std.fmt.Case.lower);
    defer allocator.free(as);
    std.log.info("Fib: {s}", .{as});

    std.debug.print("number of limbs = {}\n", .{a.limbs.len});
}
 $ time ./t
number of limbs = 5425

________________________________________________________
Executed in    5.83 secs    fish           external
   usr time    5.70 secs  663.00 micros    5.70 secs
   sys time    0.01 secs   88.00 micros    0.01 secs

Removing the toString and log results in the following for 500,000.

$ time ./t
number of limbs = 5425

________________________________________________________
Executed in    2.62 secs    fish           external
   usr time    2.56 secs  631.00 micros    2.56 secs
   sys time    0.01 secs   81.00 micros    0.01 secs

This looks to increase proprotionally. For reference I got the following times with the conversion to string removed.

1,000,000 10.41s
2,500,000 97.48s

Which looks much more in line with the Golang performance scaling. One would need to investigate toString more in detail. The algorithm used is fairly trivial for base-10 and requires a lot of divisions in the current state. There may be a smarter algorithm to convert to a string for this or some assumptions we can make note of to improve this.

derpycoder commented 2 years ago

I am running the same code again, with and without the toString part. And speed is completely different from what I got in the morning. (I think, I might have messed up something to get wrong data.)

As of now, for 1,000,000: With toString, I got 55.4 s, and without toString I got 48 s.

Removing toString didn't improve the performance by that much, so please don't overlook that perhaps the whole implementation has inefficiency strewn into it.

tiehuis commented 2 years ago

I did some more testing and I'm fairly certain this is related solely to the toString implementation.

$ zig build-exe t.zig -O ReleaseFast -lc
$ perf record ./t
$ perf display

Cycle timings are:

Samples: 774K of event 'cycles:u', Event count (approx.): 680188163038
Overhead  Command  Shared Object     Symbol
  53.28%  t        t                 [.] compiler_rt.udivmod.udivmod                                                                                                                                                                                                                                                          ▒
  45.50%  t        t                 [.] std.math.big.int.Mutable.addCarry                                                                                                                                                                                                                                                    ▒
   0.75%  t        t                 [.] main                                                                                                                                                                                                                                                                                 ▒
   0.08%  t        t                 [.] __udivti3                                                                                                                                                                                                                                                                            ▒
   0.08%  t        [unknown]         [k] 0xffffffff92e010a7                                                                                                                                                                                                                                                                   ▒
   0.06%  t        libc-2.33.so      [.] __memmove_avx_unaligned_erms                                                                                                                                                                                                                                                         ▒
   0.06%  t        libc-2.33.so      [.] cfree@GLIBC_2.2.5                                                                                                                                                                                                                                                                    ▒
   0.05%  t        libc-2.33.so      [.] _int_free                                                                                                                                                                                                                                                                            ▒
   0.04%  t        libc-2.33.so      [.] malloc                                                                                                                                                                                                                                                                               ▒
   0.03%  t        libc-2.33.so      [.] _int_malloc                                                                                                                                                                                                                                                                          ▒
   0.01%  t        libc-2.33.so      [.] _mid_memalign                                                                                                                                                                                                                                                                        ▒
   0.01%  t        t                 [.] std.math.big.int.Managed.ensureCapacity                                                                                                                                                                                                                                              ▒
   0.01%  t        t                 [.] 0x00000000000043a0                                                                                                                                                                                                                                                                   ▒
   0.01%  t        t                 [.] std.heap.CAllocator.alloc                                                                                                                                                                                                                                                            ▒
   0.01%  t        libc-2.33.so      [.] __posix_memalign                                                                                                                                                                                                                                                                     ▒
   0.01%  t        t                 [.] std.heap.CAllocator.free                                                                                                                                                                                                                                                             ▒
   0.00%  t        t                 [.] 0x00000000000043e0                                                                                                                                                                                                                                                                   ▒
   0.00%  t        libc-2.33.so      [.] __malloc_usable_size                                                                                                                                                                                                                                                                 ▒
   0.00%  t        t                 [.] std.heap.CAllocator.resize                                                                                                                                                                                                                                                           ▒
   0.00%  t        libc-2.33.so      [.] __brk                                                                                                                                                                                                                                                                                ▒
   0.00%  t        libc-2.33.so      [.] systrim.constprop.0                                                                                                                                                                                                                                                                  ▒
   0.00%  t        libc-2.33.so      [.] sysmalloc                                                                                                                                                                                                                                                                            ▒
   0.00%  t        t                 [.] 0x0000000000004380                                                                                                                                                                                                                                                                   ▒
   0.00%  t        libc-2.33.so      [.] __sbrk                                                                                                                                                                                                                                                                               ▒
   0.00%  t        libc-2.33.so      [.] alloc_perturb                                                                                                                                                                                                                                                                        ▒
   0.00%  t        t                 [.] 0x0000000000004390                                                                                                                                                                                                                                                                   ▒
   0.00%  t        libc-2.33.so      [.] unlink_chunk.constprop.0                                                                                                                                                                                                                                                             ▒
   0.00%  t        [unknown]         [k] 0xffffffff92e00158                                                                                                                                                                                                                                                                   ▒
   0.00%  t        ld-2.33.so        [.] _dl_lookup_symbol_x                                                                                                                                                                                                                                                                  ▒
   0.00%  t        ld-2.33.so        [.] handle_intel.constprop.0                                                                                                                                                                                                                                                             ▒
   0.00%  t        ld-2.33.so        [.] _dl_start                                                                                                                                                                                                                                                                            ▒

The cycle percentages are roughly the same, even as the input parameters increase (500,000 to 2,500,000).

The udivmod is a compiler-rt division that is generated when doing the single-limb divisions to generate decimal output. The generated output looks a bit poor but the root issue is we don't want to emit a udivmod and should prefer better assembly generated instead. There are ways to do this without requiring the division, see https://gmplib.org/manual/Single-Limb-Division for examples. A simple method we could use that would speed up this case would be to add the concept of a half-limb. We can specialize single-limb division into the current (full limb) and a new half-limb routine which could operate entirely on a usize limb and not have to emit the udivmod instruction.

The relevant code is https://github.com/ziglang/zig/blob/538c9e7bafd859148352e76c1d8053e1ba426a06/lib/std/math/big/int.zig#L3185. Specifically the else case where the mod is performed.

I would suggest adding a new low-level routine (e.g. lldivhalf), a new HalfLimb type, and modifying the higher level div call at https://github.com/ziglang/zig/blob/538c9e7bafd859148352e76c1d8053e1ba426a06/lib/std/math/big/int.zig#L1338 to check if the divisor fits in a half-limb. I expect this should reduce this runtime a lot but should be implemented to confirm.

If you don't think this is correct or have information that suggests otherwise can you do an in-depth profile (e.g. with perf) to display the results you are getting.

tiehuis commented 2 years ago

And just to clarify on your initial expectations:

| The program should execute and return result in less time than Golang or Rust.

The performance of a big integer library isn't too related to the language, especially as the number sizes grow. The main factor that determines performance are the algorithms and core routines which are called repeatedly. These are often written in assembly for speed. I would expect us to be at least in line with the Golang implementation but don't expect faster performance compared to another library unless the methods used are identical (or have been optimized).

derpycoder commented 2 years ago

I realized that my micro benchmarking strategy was flawed, because of this.

I won't judge Ziglang based on it's internal library. I found out there are Rust libs with even faster results than the one I tested. So it came down to library and its implementation, which Ziglang will eventually have.


My gripes with it, are:

  1. Installation gave me a bit of a pain. (Both Zig & ZLS install)
  2. I tried using Ziglang for game dev, but the package installation was so arcane. I had to download SDL with homebrew, but the project wasn't able to find it.
  3. There's no package manager, I hated having to manually install libraries.
  4. Sometimes, the compiler was not giving detailed errors of what I had done wrong.
  5. Missing documentation of Big Int, so I had to go through the Source Code.

But the best part of my experience with Ziglang was:

  1. The speed with which compilation happens with larger projects I tried to run.
  2. The fact that, I was able to use C libs without a hitch.
  3. More surprisingly I was able to write something functional without knowing everything about the language.
  4. Language is super simple.

I searched for Ziglang, because of how verbose or flawed everything else is. I want to make games with it. I will wait for it to mature. Looking forward to fast compiler!!

I love Zig. ❤️

ominitay commented 2 years ago

@abhijit-kar Just fyi, there are presently two package managers for Zig: Zigmod, and Gyro. There will be an official one in future. You might have better luck with SDL by using SDL.zig too

Ev1lT3rm1nal commented 9 months ago

I still find big int very slow