zeek / spicy

C++ parser generator for dissecting protocols & files.
https://docs.zeek.org/projects/spicy
Other
249 stars 37 forks source link

Runtime differences single- vs multi-threaded host applications #1600

Closed awelzel closed 5 months ago

awelzel commented 11 months ago

This is again a micro-benchmark, but I think an interesting observation. Relates to zeek/zeek#3379.

When running spicy-driver for micro-benchmarking, glibc is running in single-threaded mode, avoiding usage of atomic instructions for std::shared_ptr. Within multi-threaded applications like Zeek, std::shared_ptr usage is more expensive. Patching spicy-driver.cc to start a very short-lived thread, thereby switching glibc into multi-threaded mode, the attached micro-benchmark runs 6% slower on my system due to what seems just use of atomic instructions for std::shared_ptr.

Recording the spicy-driver run with perf record --call-graph dwarf, the __gnu_cxx::__exchange_and_add function is reported with ~5.8% samples as hottest function. In a zeek -r test with the QUIC analyzer, it shows up with ~3% samples.

Not quite sure there's something that can be fixed unless removal of std::shared_ptr is on the table, but opening mostly as FYI.

(Testing was done with code from #1590)

$ hyperfine -w 1 'cat test-data.txt | spicy-driver nested.hlto' 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
Benchmark 1: cat test-data.txt | spicy-driver nested.hlto
  Time (mean ± σ):     11.364 s ±  0.067 s    [User: 11.343 s, System: 0.038 s]
  Range (min … max):   11.255 s … 11.463 s    10 runs

Benchmark 2: cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto
  Time (mean ± σ):     12.011 s ±  0.140 s    [User: 11.982 s, System: 0.036 s]
  Range (min … max):   11.885 s … 12.357 s    10 runs

Summary
  'cat test-data.txt | spicy-driver nested.hlto' ran
    1.06 ± 0.01 times faster than 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
module Test;

type J = unit {
  x: bytes &size=1;
};

type K = unit {
  j: J;
};

type L = unit {
  k1: K;
  k2: K;
  k3: K;
  k4: K;
};

type M = unit {
  l1: L;
  l2: L;
  l3: L;
  l4: L;
};

public type N = unit {
  : M[] &eod;
};

Patch to spicy-driver.cc:

--- a/spicy/toolchain/bin/spicy-driver.cc
+++ b/spicy/toolchain/bin/spicy-driver.cc
@@ -2,8 +2,11 @@

 #include <getopt.h>

+#include <cstdio>
+#include <cstdlib>
 #include <fstream>
 #include <iostream>
+#include <thread>

 #include <hilti/rt/libhilti.h>

@@ -289,6 +292,18 @@ int main(int argc, char** argv) {
                 if ( ! parser )
                     driver.fatalError(parser.error());

+               std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
+               if ( std::getenv("SPICY_THREAD") != nullptr ) {
+                       std::fprintf(stderr, "starting thread\n");
+                       std::thread t([]() {
+                               std::fprintf(stderr, "running in thread\n");
+                       });
+                       t.join();
+               }
+
+               std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
                 if ( auto x = driver.processInput(**parser, in, driver.opt_increment); ! x )
                     driver.fatalError(x.error());
             }
bbannier commented 11 months ago

Thanks for this issue @awelzel, this is a useful microbenchmarking result. We use std::shared_ptr extensively in the runtime library, e.g., as control blocks in views and our safe iterators which we copy heavily; we definitely want to avoid unneeded overhead there.

I ran your reproducer with Clang and libc++ and I observe no significant overhead, and we do not want overhead for GCC and libstdc++ either.

bbannier commented 9 months ago

Unassigning myself since ATM there is no clear path forward which doesn't involve a lot of work like implementing a standard libary-quality smart pointer library.

rsmmr commented 5 months ago

I don't see us changing the smart pointer any time soon, so closing because of the lack of a way forward.