Open awelzel opened 11 months ago
Your patched code definitely looks better, but the question is how hard it is to generate that code in general (i.e., the responsible section in the parser builder might create more state for other input; in modifications of the parser state via pushState
/popState
might make this hard).
Maybe a less intrusive fix would be to instead use an optional<tuple>
to reduce the cost from default-initializing __result
(even though you mentioned https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86173 offline). It should be possible to start it out as a default-constructed (unset) optional
, assign to it like to a tuple, and then at the end unconditionally deref it.
Like we discussed offline, one approach of not dealing with the different code paths would be to trade work on running the default constructors of the tuple elements for default initializing a tuple, e.g.,
From a36af2d6042247b7754d8f00981aebbb62972139 Mon Sep 17 00:00:00 2001
From: Benjamin Bannier <benjamin.bannier@corelight.com>
Date: Fri, 10 Nov 2023 12:40:52 +0100
Subject: [PATCH] Postphone allocating result tuple in generated code.
---
spicy/toolchain/src/compiler/codegen/parser-builder.cc | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/spicy/toolchain/src/compiler/codegen/parser-builder.cc b/spicy/toolchain/src/compiler/codegen/parser-builder.cc
index cfcf647e3..3aafaf311 100644
--- a/spicy/toolchain/src/compiler/codegen/parser-builder.cc
+++ b/spicy/toolchain/src/compiler/codegen/parser-builder.cc
@@ -298,7 +298,7 @@ struct ProductionVisitor
std::vector<Type> x = {type::stream::View(), look_ahead::Type, type::stream::Iterator(),
type::Optional(builder::typeByID("hilti::RecoverableFailure"))};
auto result_type = type::Tuple(std::move(x));
- auto store_result = builder()->addTmp("result", result_type);
+ auto store_result = builder()->addTmp("result", type::Optional(result_type));
auto try_ = begin_try();
@@ -360,7 +360,7 @@ struct ProductionVisitor
run_finally();
popState();
- builder()->addReturn(store_result);
+ builder()->addReturn(builder::deref(store_result));
return popBuilder()->block();
}; // End of build_parse_stage1()
--
2.42.1
In theory this could be much faster, but like is not (at least with GCC), https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86173.
I tested this in a huge internal analyzer we have and saw small throughput improvements when compiling with Clang, but unchanged or maybe even worse throughput with GCC.
Similar to #1592 (this is using the same netsted.spicy reproducer), patching the generated code to remove the default initialization of the std::tuple
__result
in the__parse
functions results in significant parsing performance improvement for GCC + ibstdc++ on Debian 12.Test2.cc removes
__result
from the parse stage2 functions and returnsstd::make_tuple
directly. This results in a 14% improvement and seems a correct optimization.Test22.cc additionally does the same in stage1 function. Though that short-cuts some of the
__error
assignments (see also #1592). The result is 1) over-promising and 2) probably also incorrect.Main point of the ticket: Default initialization of the std::tuple might not come for free and at least with GCC/libstdc++ it's significant overhead. This was found by running
perf annotate
on the parse functions with the hottest instruction beingrep stos
hinting at initialization of the tuple. Using godbolt and pinging Benjamin helped. Wrapping the std::tuple in a std::optional wouldn't help with GCC, as that seems to also always fully initialize an optional.I don't know the right fix, mostly just want to raise awareness of missed optimization opportunity.
cat ./test-data.txt \| spicy-driver Test.hlto
cat ./test-data.txt \| spicy-driver Test2.hlto
cat ./test-data.txt \| spicy-driver Test22.hlto
Test2.cc patch:
Test2.cc to Test22.cc patch: