ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.49k stars 2.45k forks source link

musl update to 1.2.1 causes slowdown. #11298

Closed ghuls closed 2 weeks ago

ghuls commented 2 years ago

Zig Version

0.10.0-dev.750+cf5009f9a

Steps to Reproduce

# Clone Cluster-Buster repo.
git clone https://github.com/weng-lab/cluster-buster

cd cluster-buster

# Compile Cluster-Buster with different versions of zig.
for zig_version in 0.10.0-dev.750+cf5009f9a 0.6.0+e5fab3b 0.6.0+030f003 ; do
    for architecture in x86_64 ; do
        ../zig/zig-linux-x86_64-${zig_version}/zig c++ \
            -v -Wall \
            -std=c++0x \
            -O3 \
            -march="${architecture}" \
            -mtune="${architecture}" \
            -D NDEBUG \
            -target x86_64-linux-musl \
            -D GIT_COMMIT_INFO="\"2020-05-07 10:30:42 -0400  commit: ac1d33c\"" \
            -ffast-math \
            -o "cbust_fast_math_${architecture}_zig${zig_version//./}" \
             *.cpp
    done
done

Create input files:

# Create motif file.
echo '>motif1
0,0,100,0
0,100,0,0
0,0,0,100
0,50,50,0
0,100,0,0
0,50,0,50
0,0,50,50
0,100,0,0
0,50,0,50
0,0,100,0
0,50,0,50
33,33,0,34
' | tr ',' '\t' > /tmp/motif.cb

# Create FASTA file.
awk '
BEGIN {
    choose_nuc[0] = "A";
    choose_nuc[1] = "C";
    choose_nuc[2] = "G";
    choose_nuc[3] = "T";

    for (i = 1; i <= 250000; i++) {
        print ">seqeunce_" i;

        for (j = 0; j < 500; j++) {
            printf "%s", choose_nuc[int(rand() * 4)];
        }
        print;
    }
}' > /tmp/fasta.fa

Run Cluster-Buster binaries for each compiled version (with different zig versions) and check execution time.

time cbust_compiled_with_zig_version_x -f 4 /tmp/motif.cb /tmp/fasta.fa > /dev/null

Expected Behavior

I expect that newer zig versions are faster (or equal in speed) than older ones.

I had a compiled Cluster-Buster version that I compiled 2020-11-05 and this is the fastest version of Cluster-Buster I ever managed to compile. I didn't know with which version and settings ( it was compiled (march, tune) with) but I remembered it was a development version of zig at the time. I never managed to create a faster or same speed binary with any zig version afterwards (0.7.0 => current).

Today I tried to find which zig version it could have been and I managed to find it.

$ strings cbust_fastest | grep zig
/software/zig/zig-linux-x86_64-0.6.0+17837affd/lib/zig/libcxxabi/src/demangle/ItaniumDemangle.h
/software/zig/zig-linux-x86_64-0.6.0+17837affd/lib/zig/libcxxabi/src/cxa_demangle.cpp

I tracked down the commit that introduced the slowdown: https://github.com/ziglang/zig/commits/master?after=7f91be9c80f8d79e4a2c6cf0b15197cdd454cef6+6335&branch=master

Slow update musl sources to 1.2.1 @andrewrk andrewrk committed on Nov 6, 2020

update musl headers to 1.2.1 @andrewrk andrewrk committed on Nov 6, 2020

update process_headers tool @andrewrk andrewrk committed on Nov 6, 2020

As fast as zig 0.6.0+17837affd std: Introduce SemanticVersion data structure @jayschwa @andrewrk jayschwa authored and andrewrk committed on Nov 6, 2020

Something in the musl lib update seems to cause the slow down.

Actual Behavior

When running all binaries with perf, it seems like the musl update causes a huge number of pagefaults and the sytem time goes to 1.1 second vs to less than 0.1 for the fast binary (zig0.6.0+030f003).

$ perf stat ./cbust_fast_math_x86_64_zig0100-dev750+cf5009f9a -f 4 /tmp/motif.cb /tmp/fasta.fa > /dev/null

 Performance counter stats for './cbust_fast_math_x86_64_zig0100-dev750+cf5009f9a -f 4 m/tmp/motif.cb /tmp/fasta.fa':

          16152.22 msec task-clock:u              #    0.990 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
            282135      page-faults:u             #    0.017 M/sec
       51491768159      cycles:u                  #    3.188 GHz
       78484700109      instructions:u            #    1.52  insn per cycle
       10115988739      branches:u                #  626.291 M/sec
         125795653      branch-misses:u           #    1.24% of all branches

      16.321717146 seconds time elapsed

      15.031987000 seconds user
       1.123085000 seconds sys

$ perf stat ./cbust_fast_math_x86_64_zig060+e5fab3b -f 4 /tmp/motif.cb /tmp/fasta.fa > /dev/null

 Performance counter stats for './cbust_fast_math_x86_64_zig060+e5fab3b -f 4 /tmp/motif.cb /tmp/fasta.fa':

          17711.34 msec task-clock:u              #    0.998 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
            280616      page-faults:u             #    0.016 M/sec
       52003067748      cycles:u                  #    2.936 GHz
       85803272856      instructions:u            #    1.65  insn per cycle
       11490153668      branches:u                #  648.745 M/sec
         132034190      branch-misses:u           #    1.15% of all branches

      17.754947524 seconds time elapsed

      16.541077000 seconds user
       1.171173000 seconds sys

$ perf stat ./cbust_fast_math_x86_64_zig060+030f003 -f 4 /tmp/motif.cb /tmp/fasta.fa > /dev/null

 Performance counter stats for './cbust_fast_math_x86_64_zig060+030f003 -f 4 /tmp/motif.cb /tmp/fasta.fa':

          15094.25 msec task-clock:u              #    0.998 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             15851      page-faults:u             #    0.001 M/sec
       51491411258      cycles:u                  #    3.411 GHz
       85712078427      instructions:u            #    1.66  insn per cycle
       11470618728      branches:u                #  759.933 M/sec
         123801113      branch-misses:u           #    1.08% of all branches

      15.131651655 seconds time elapsed

      15.001611000 seconds user
       0.093772000 seconds sys
12101111 commented 2 years ago

musl introduced a new malloc implementation called mallocng in 1.2.0. Both the oldmalloc and mallocng are designed for embedded usage, and have bad performance for multi-thread computing.

You can build and link those malloc implementation as a replacement: https://github.com/microsoft/mimalloc https://github.com/microsoft/snmalloc https://github.com/mjansson/rpmalloc https://github.com/jemalloc/jemalloc

ghuls commented 2 years ago

Cluster-Buster is a single threaded program.

andrewrk commented 2 years ago

Here's the plan to address this: