zwegner / faster-utf8-validator

A very fast library for validating UTF-8 using AVX2/SSE4 instructions
218 stars 10 forks source link

Investigate adaptive approaches #3

Open zwegner opened 4 years ago

zwegner commented 4 years ago

BeeOnRope brought up some good points in this subthread of my HN post: https://news.ycombinator.com/item?id=21550453

I can see two main approaches to adaptation: whether to use an early exit for ASCII-only input, and only computing the continuation mask for 4-byte (and maybe 3-byte?) sequences lazily, if the normal checks fail. Here's some quick notes on both of these approaches, and then some general notes.

ASCII approach: The main idea is to take away the misprediction penalty that happens when input has long strings of only-ASCII or only-non-ASCII, and hoist the unpredictable branch to an outer loop of N vectors of input. The unpredictable branch would be based on a counter, updated in both the fast-ASCII-exit and branchless paths. This could be made very cheap. Where we have in the current code:

    if (!(high | *last_cont))
        return 1;

This would turn into something like this:

    vmask_t cond = !(high | *last_cont);
    *counter += cond;
    if (BRANCHY && cond)
        return 1;

...where BRANCHY is a macro/specialization that determines whether we're compiling the branchless version or not. If the counter indicates X out of N vectors in the outer loop were ASCII-only, we should use the fast-ASCII-path version in the next chunk.

4-byte approach: I'll just copy+paste part of one of my posts: because my algorithm uses lookup tables for error flags which have some free bits, and because the 4 byte sequences can be detected in the same indices that are used for these lookups, we can set another error bit that means "take the 4-byte slow path". Then, when some input fails validation, we only do the work there to check whether it's really a failure. This gets complicated, though: first off, the check for the proper number of continuation bytes is before the table lookup, so we'd need to put some logic in there. Secondly, this lookup table gets validation failures one byte later in the stream than the initial byte. So in the case that the 4-byte sequence starts on the last byte of a vector, we'd need to have special handling.

General issues:

If anyone has any thoughts on these issues, particularly anyone that has plans to use this code in a real project, I'd like to hear from them.

travisdowns commented 4 years ago

Some general thoughts:

I agree with your concern over codesize. This is generally a difficult consideration to fully integrate into a benchmark, as unless your code is grossly obese even large methods will fit well enough in the inner most cache levels. It only shows up in the "real world" and confusingly your individual method might still time as better, but then the overall application slows down, because much of the penalty is taken outside your method.

Still, if you want to gold-plate this, you can still accommodate it: you can put some reasonable upper bound on the cost of an icache miss (e.g., say it causes a stall of roughly a miss to memory, 100 ns or so) and you also know how much improvement increasing the code size gives you, "per byte" - so you could calculate a crossover point, e.g., if the input is longer than X, it's worth using the extra code size, because the misses will always be paid off.

Usually I don't go that far (actually calculating the crossover), but I just try to build in code size awareness in a reasonable way: e.g. ensure the "small input" case doesn't use many lines and then only go to the bit unrolled loop after some number of iterations - this often dovetails nicely with wanting to have special-case "small input" code anyway where constant factors and intro/outro handling matters relatively more.

travisdowns commented 4 years ago

BTW, if you want to run an AVX-512 algorithm, I'm happy to run them for you (if you make it easy, e.g., clone this project and run these benchmark scripts).

That said, you can also rent low-core EC2 instances with AVX-512 for something like 5 cents an hour, with pretty reliable performance. You can log on and run your benchmark scripts and stop the instance and it will probably be 1 cent in total or whatever.

travisdowns commented 4 years ago

One project that I know is interested in this kind of fast UTF-8 validation is simdjson and they have an outstanding PR with a similar algorithm (at least per the basic strategy of looking up the nibbles).

Their use case is slightly different from the benchmark in the sense that the UTF8 validation is, I think, interleaved fine-grained with other parts of the json parsing, rather than done as a block apart from the parsing, so the considerations are slightly different (e.g., port pressure is different, there might be more register pressure, and some latencies that might matter in a pure benchmark might be hidden).

zwegner commented 4 years ago

Ohhh, hey Travis. Crazy, I was recently reading through your fixed-regex-backreferences-in-P repo, which was a response to @geofflangdale, who I think asked that after we were discussing that issue in an email thread. Small world. If there isn't some kind of SIMD-nerds forum/mailing list for talking about this sort of stuff, there really should be...

Not a bad idea to try to create a model for I$ misses. I'd generally expect things like this to be quite chaotic in real systems, though, when you start getting into code layout, set associativity, etc. But your idea of keeping the quick version small and un-unrolled is a good step. It would match nicely with the branch hoisting adaptiveness here, too: the main loop would get grouped into larger chunks that decide for several vector's worth of input whether to use the ASCII early exit or not, and all of those are unrolled inside the two paths. The remainder of the input vectors that don't fit into a chunk would be rolled up and not exert too much I$ pressure.

As an aside, I wish there was some good, fast, open-source simulator technology. Use public information from Agner Fog/wikichip/whatever to build detailed microarchitectural models, and add instrumentation to get perf counters on steroids, or build visualizations to see instructions flow through the pipeline for important kernels. Add JIT-ing to make it fast enough to use on large programs without too many orders of magnitude slowdown. That'd sure be useful...

zwegner commented 4 years ago

Their use case is slightly different from the benchmark in the sense that the UTF8 validation is, I think, interleaved fine-grained with other parts of the json parsing, rather than done as a block apart from the parsing, so the considerations are slightly different (e.g., port pressure is different, there might be more register pressure, and some latencies that might matter in a pure benchmark might be hidden).

That's what I thought might be happening at first, too, but they were pretty smart about it: the entire input is UTF-8-validated for 128 byte chunks, while the other JSON parsing happens in parallel (in a superscalar sense). See: https://github.com/lemire/simdjson/blob/master/src/generic/stage1_find_marks.h#L234-L268

travisdowns commented 4 years ago

Not a bad idea to try to create a model for I$ misses. I'd generally expect things like this to be quite chaotic in real systems, though, when you start getting into code layout, set associativity, etc.

Definitely, but that's why I say "upper bound" - i.e., in the worst case we just assume that every i-cache line misses, so if your code takes 640 bytes (10 lines) then that's 10 misses. Of course, that's not really truly an upper bound, since worse things could happen (e.g., itlb misses and page walks) - but it's a pretty realistic upper bound for most cases I think. Then you can at least build a version that "does no harm" in the sense that it only uses the big code when it knows it benefits even with upper bound figures. In reality, you'd probably want to set the cutover lower since you don't really want to target the worst case - so this is at least partly a framework for thinking about instruction size versus runtime performance. Sometimes it can just stop an argument in its tracks e.g., because someone says "but what about icache misses" and you can show that even with say 20 icache misses you know your input is always 10 MB and the saving always overwhelms it.

For a library where you don't have a known input size distribution it evidently gets more complicated!

As an aside, I wish there was some good, fast, open-source simulator technology. Use public information from Agner Fog/wikichip/whatever to build detailed microarchitectural models, and add instrumentation to get perf counters on steroids, or build visualizations to see instructions flow through the pipeline for important kernels. Add JIT-ing to make it fast enough to use on large programs without too many orders of magnitude slowdown. That'd sure be useful...

You are not alone. Have you looked at Gem5 or snipersim? They are somewhere between not-terrible and not-great. They are probably not useful for fine-grained tweaking like moving some instructions around or replacing a shuffle with two blends or whatever, but for larger changes they can be helpful.

I use performance counters a lot and with small controlled tests they can be very reliable, e.g., down to 0.1% for most or better if you take even more precautions. There are some undocumented counters that I have been finding useful too, I'll write about them maybe. This has the advantage of exactly matching the hardware (after all it is the hardware), but noise and limitations like only up to N counters are annoying.

travisdowns commented 4 years ago

That's what I thought might be happening at first, too, but they were pretty smart about it: the entire input is UTF-8-validated for 128 byte chunks, while the other JSON parsing happens in parallel (in a superscalar sense).

Sure, but this still falls well within fine grained for me! As in, interleaved at the 10s or even low 100s of cycles level, meaning that there will be a lot of overlap between the instructions from both blocks, and so they can't be considered independent but rather have to be treated as a system. Coarse-grained interleaving would mean something like several KiB or more of input, so that the functions mostly don't overlap in an OoO sense, so they can be treated as two separate black boxes.

Indeed, the simdjson guys often talk about changes where some change was faster/slower in isolation (e.g., utf8 parsing only), but when interleaved, the timing reversed. So for them, they really have to test in-situ every time.

If there isn't some kind of SIMD-nerds forum/mailing list for talking about this sort of stuff, there really should be...

Yeah I don't know of a good one.

zwegner commented 4 years ago

Definitely, but that's why I say "upper bound" - i.e., in the worst case we just assume that every i-cache line misses, so if your code takes 640 bytes (10 lines) then that's 10 misses. Of course, that's not really truly an upper bound, since worse things could happen (e.g., itlb misses and page walks) - but it's a pretty realistic upper bound for most cases I think.

Wouldn't it be 20 misses, for the other code that was in the I$? And the measurement of an I$ miss is rather nontrivial itself... I'll probably hold off at least until I get some more realistic benchmarking, or maybe until somebody complains. :)

Moving to C++ is probably not a bad idea here: with a template interface, it'd be easy to make a big code size footprint opt-in: only instantiate the small version if it's not called often or only with small-ish inputs.

You are not alone. Have you looked at Gem5 or snipersim? They are somewhere between not-terrible and not-great. They are probably not useful for fine-grained tweaking like moving some instructions around or replacing a shuffle with two blends or whatever, but for larger changes they can be helpful.

I hadn't seen those, thanks for pointing them out. Weird that snipersim requires you to fill out a form to download it or see the source... But it uses Pin anyways, so that's probably out for me (dang macOS).

I use performance counters a lot and with small controlled tests they can be very reliable, e.g., down to 0.1% for most or better if you take even more precautions. There are some undocumented counters that I have been finding useful too, I'll write about them maybe. This has the advantage of exactly matching the hardware (after all it is the hardware), but noise and limitations like only up to N counters are annoying.

Right, I haven't used PMU counters in a while. Looking around, macOS has a profiler that can access them, but I'd need to install the multi-GB Xcode behemoth. If I get too much deeper in performance work, the benefits of Linux would probably overcome the pain of using two OSes...

Indeed, the simdjson guys often talk about changes where some change was faster/slower in isolation (e.g., utf8 parsing only), but when interleaved, the timing reversed. So for them, they really have to test in-situ every time.

That's been a common theme throughout most of software career whenever measurement is involved. You can usually only get accurate/unbiased answers by directly observing the exact thing you care about, which if you're lucky is one concrete thing. For a library that has unknown uses, you can really just guess what sort of inputs you might get.

travisdowns commented 4 years ago

Wouldn't it be 20 misses, for the other code that was in the I$?

Yes, you are right, it would be up to 20 - one coming in (which could be caught by benchmarking in-situ the decoding itself), and then one later when that line is evicted (which would be spread out across the application thus harder to link back to the algo).

And the measurement of an I$ miss is rather nontrivial itself... I'll probably hold off at least until I get some more realistic benchmarking, or maybe until somebody complains. :)

Yes, although to be clear, I wasn't even suggesting any measurement. Simply look at the code size and divide by 64 basically to determine the max reasonable number of possible icache misses. Just to move the discussion from the totally non-quantitative space of "it's faster, but the code grew, so is it really better?" to "well the code grew by 500 bytes, so that's 8 icache lines, hence at most 16 misses, so if we can save 1,600 nanos over a single invocation, we can be pretty darn sure it's a win. If we can save say 200 nanos, then we are deep in a grey area".

It turns out that even with very rough estimates, and gross overestimation of the number of misses1, the tradeoff point is quickly met, so extra code isn't all that bad when the inputs are good.

I use the same kind of rough analysis to decide if/when a large(r) LUT will be useful. LUTs can be much bigger than reasonable functions (e.g., 100s of KiBs).


1 In particular for a contiguous function, counting the number of lines to estimate cache misses is likely to be a big overestimate because we'd expect the fetch engine to prefetch/overlap nearly all of these misses since they are contiguous, so the cost is likely to be closer to 1 or 2 misses total. The "trailing" misses in application code that had their line evicted is harder to model but the worst-case is somewhat more likely because there is no particular reason to assume the misses will be contiguous.

travisdowns commented 4 years ago

Moving to C++ is probably not a bad idea here: with a template interface, it'd be easy to make a big code size footprint opt-in: only instantiate the small version if it's not called often or only with small-ish inputs.

I have mostly given up on C for high-performance code. Even when I think "OK, I can do this in pure C" and it works out, I inevitably run into friction where I wish I had C++, or when I later go to extend it, I wish it was C++.

I used C because it has wider support so more people could use it, but since I've moved mostly to C++ what I'll do is sometimes offer a C header as FFI into the C++ implementation (of course, this precludes "header only").

Right, I haven't used PMU counters in a while. Looking around, macOS has a profiler that can access them, but I'd need to install the multi-GB Xcode behemoth. If I get too much deeper in performance work, the benefits of Linux would probably overcome the pain of using two OSes...

I think it's very hard to beat or approach Linux for most high-perf coding. Things like perf and just too useful and powerful, and OSX (and Windows) lag behind there. Although XCode does offer some perf counter access, I think it's fairly limited and lagging.

The good news is you can do your benchmarking on Linux, and it will mostly carry directly over to OSX and Windows as long as the generated code is similar (especially if you use e.g., clang on all three platforms), and there is not a large OS-dependent component (e.g., kernel calls).

zwegner commented 4 years ago

Yes, although to be clear, I wasn't even suggesting any measurement. Simply look at the code size and divide by 64 basically to determine the max reasonable number of possible icache misses. Just to move the discussion from the totally non-quantitative space of "it's faster, but the code grew, so is it really better?" to "well the code grew by 500 bytes, so that's 8 icache lines, hence at most 16 misses, so if we can save 1,600 nanos over a single invocation, we can be pretty darn sure it's a win. If we can save say 200 nanos, then we are deep in a grey area".

Oh, I shouldn't have said measurement there--maybe "calculating the cost" instead. What I was thinking was that, besides the L2$ latency for pulling in code into I$ (assuming L2 fits the code, which is uncertain), there's other negative effects. My understanding was that modern Intel chips have reduced performance on the first run of code that's just been brought into I$, since there's metadata about instruction start points to help in decoding. I can't remember where I read this, though, and perusing wikichip doesn't show anything that quite matches--maybe this predates the uOP cache. I'm also not sure the fetch engine will be prefetching contiguous lines into I$, if there's no branch prediction information from the current instruction stream--seems like that would be wasteful in the case of taking a rare, small, side branch for a special case. I really don't know though, and it doesn't look like there's much public information about it (at least on wikichip).

I used C because it has wider support so more people could use it, but since I've moved mostly to C++ what I'll do is sometimes offer a C header as FFI into the C++ implementation (of course, this precludes "header only").

Do you know of anyone that hasn't used your code due to using C++? Even without the adaptive algorithm changes, using C++ would already be nice to get rid of all the stupid macros for various ISAs... I'll probably just put a note in the README that C users can use the library from an old commit if they really need it.

I will definitely add a C API, though, since I need that to call into the library from LuaJIT in my test script.

travisdowns commented 4 years ago

Oh, I shouldn't have said measurement there--maybe "calculating the cost" instead. What I was thinking was that, besides the L2$ latency for pulling in code into I$ (assuming L2 fits the code, which is uncertain), there's other negative effects.

Yes, sure - but I'm ballparking 100 nanoseconds per miss. That's way more than the L2 latency of about 3 nanos. That is, I'm assuming that every icache miss is a full miss all the way to DRAM. So I would call it a very conservative upper bound.

My understanding was that modern Intel chips have reduced performance on the first run of code that's just been brought into I$, since there's metadata about instruction start points to help in decoding. I can't remember where I read this, though, and perusing wikichip doesn't show anything that quite matches--maybe this predates the uOP cache.

Modern "big" Intel chips (e.g., server and laptop chips) don't save boundaries like that and haven't done it for a long time, but smaller chips like Goldmont apparently do. I think recent AMD chips may also do. That said, there is a similar effect: the uop cache needs to populated and this has a cost the first few times fresh code is run.

All that said, these costs are small "per cache line", you are talking a cycle here or a cycle there of stalling, or maybe a reduced IPC from 4 to 3 or something. So like single digit nanos. Since I'm giving a budget of 100 nanos, this is almost lost in the noise.

If you wanted to model this much more precisely, you'd have to care about all this, of course.

I'm also not sure the fetch engine will be prefetching contiguous lines into I$, if there's no branch prediction information from the current instruction stream--seems like that would be wasteful in the case of taking a rare, small, side branch for a special case.

Yes, there will be no BP info, so the default is to predict sequential execution, i.e., "no branches" (plus occasionally you might get a spurious hit when you get a collision in the branch predictor). You can safely assume that contiguous code sequences will efficiently fetched: this after all is the common and important case when you jump to some cold code. Yes, you'll sometimes bring in a line that you don't need - but that cost is small compared to taking misses one-at-a-time in code, straight-line code (note that the code doesn't actually have to be straight-line for this to work: sequential fetching is good even for code with branches because most branches are short or only taken some time, and overall functions tend to be contiguous even if they have embedded branches).

The analogy is more or less direct with data prefetching: it's really, really worth it almost all the time, even though data prefetching is probably even less reliable than instruction prefetching and you make a lot of mistakes. It's all about that MLP!

You can test this too: you can see that once you've executed one cache line of code, the next line is usually hot even if you never executed it. I have some tests which show that indirectly buried in uarch-bench somewhere.

Note that "instruction" vs "data" prefetching is only really separate at the L1 level (and L0, if you say consider uop cache L0). The L2 is combined instruction and data, so the L2 prefetcher works, IIRC, basically the same with data and instruction prefetches, at this point it's just looking at the stream of accesses and there is no branch prediction or anything involved. I wrote a bit about the L2 counters you can use to see this here.

I really don't know though, and it doesn't look like there's much public information about it (at least on wikichip).

Wikichip is a great resource for per-chip details and some light uarch details, but for in-depth stuff you have to go elsewhere. I don't know of any great single resource though. There's Anger's guides, and blog, this blog, scattered answers on StackOverflow, the Intel and AMD optimization guides, etc. You might also enjoy this article where I cover some of the common perf limiters on modern Intel and AMD chips (and really any OoO) chip, although it doesn't get into front-end/icache limits much at all (there's a section on FE effects, but I admit right up front that I'm glossing it over) </end self promotion>.

Do you know of anyone that hasn't used your code due to using C++?

I don't think people have used that much of my open source code at all :), but I haven't gotten such a note. I think most people just wouldn't tell you though. Apart from my stuff, I've definitely seen people promote and be more likely to use stuff that is "pure C" though, as it reduces some friction to use. That's why I used to try, but I've mostly regretted it and decided to go to C++, and maybe other langs (e.g., Rust) in the future - but I feel like you just run into the limitations of C fairly quickly. C++ gets a bad rap because of all the crazy metaprogramming and other inscrutable things that people like to do: but of course you are free to simply not do that!

geofflangdale commented 4 years ago

Interesting discussion.

A note on C vs C++ - we picked C++ for Hyperscan's compiler but had a C API for it, and used pure C for the Hyperscan run-time. At that time the logic was that we wanted to run in kernel, but that the compiler was too big and ugly to do in C. The chief problem people have with C++ is bringing the C++ run-time along; historically there was some possibility of being able to program in some weird subdialect of C++ that avoided exceptions and RTTI and so on and by using a bunch of crazy compiler flags you could in theory avoid roping in the C++ run-time. We put it in the too-hard basket. This technology may have matured since.

For simdjson, which is (to a surface level) more popular than Hyperscan, I just decided to use C++ and no-one blinked.

On instruction-stream effects: my already-a-bit-too-casual performance analysis modelling has never really gotten very sophisticated on the i-stream side, but we did find real-world i-stream effects where Hyperscan slowed down Suricata in practice due to i-stream stuff, despite being faster in isolation. IIRC this was mostly ITLB, not i-cache.

If you have some good benchmarks for this "interference from the rest of the workload", I've been thinking about trying to incorporate some of this in some upcoming performance analysis work. I like the theory of trying to use https://en.wikipedia.org/wiki/Robust_parameter_design for this but it remains to be seen whether all that statistical machinery is justifiable.

travisdowns commented 4 years ago

If you have some good benchmarks for this "interference from the rest of the workload"

I'm not sure if you were asking me or Zach, but personally I don't.

Basically because I think it's really hard to come up with something accurate and general, and even if you could you'd need to somehow instrument the rest of the workload to plug into your super-accurate model, e.g., figure out the distribution of instruction cache line accesses in the workload so you can model the impact of your algorithm on that ... but at that point why not just actually benchmark your algorithm in situ, which is probably both easier and more accurate. I.e., you are not going to come up with a "standalone model" which would let you answer absolutely the types of questions we were discussing here: it will always depend on the surrounding code.

So I have this very hand-wavy "upper bound" type model that I describe above, in order to make decisions while developing, i.e., to have very gross estimates of the cost of extra code, and to have a large region where I know "it doesn't matter", but then for the remainder of the giant grey area, I'll make a note that there's a code-size/perf tradeoff and try to benchmark it in situ if possible. Of course, that's not always possible, so I just try to make a reasonable tradeoff. Often it's not too hard because there will be some kind of diminishing returns, like the 5,000 byte function might be a fair amount faster than the 1,000 byte one, but the 50,000 byte one is only slightly faster, so just stick with 5,000 bytes.

So I just try to take the obvious tradeoffs, e.g., combine very similar code into a common function or do some work to avoid template related explosion even though it costs another call. Typical case: a function which is specialized on a type or non-type template parameter, which really helps performance, but in the generated function most of the code does not depend on the parameter: so when I'm trying to do it right, I'll do the work to keep the specialization, but factor out the stuff that doesn't depend on the template and dispatch to it from each specialization. So instead of 10x 1000 byte functions I have 10x 100 byte functions which call a common 1000 byte function. This almost always hurts in benchmarks, but very slightly and should be better in real life. Often I never get the chance to validate the latter part.

travisdowns commented 4 years ago

What I am quite interested in though is a good measure of code size, i.e., footprint. You can't just count the cache lines used statically, because some lines may not be touched at all. I don't think you can really count the number of lines touched dynamically either because there's a big difference between a 1000-line thing which executes 990 lines once and then spends the rest of the time looping on the last 10 lines, versus one that spends the entire time looping over all 1,000 lines.

Those are different along at least three important axes:

This article and this related one are IMO worth a read. Now, I swear he had something specific about code size, but those aren't it but still somewhat relevant.

SDE has a "footprint" mode where it spits out the code footprint of your application.

zwegner commented 4 years ago

The analogy is more or less direct with data prefetching: it's really, really worth it almost all the time, even though data prefetching is probably even less reliable than instruction prefetching and you make a lot of mistakes. It's all about that MLP!

For this (and everything you wrote before it), I think you're completely right, and I don't have much to add.

You can test this too: you can see that once you've executed one cache line of code, the next line is usually hot even if you never executed it. I have some tests which show that indirectly buried in uarch-bench somewhere.

Cool--I intend to dive more into uarch-bench at some point (and its wiki now that you mention it). Another reason macOS sucks...

Note that "instruction" vs "data" prefetching is only really separate at the L1 level (and L0, if you say consider uop cache L0). The L2 is combined instruction and data, so the L2 prefetcher works, IIRC, basically the same with data and instruction prefetches, at this point it's just looking at the stream of accesses and there is no branch prediction or anything involved. I wrote a bit about the L2 counters you can use to see this here.

Right--I was referring to the L1I$ here though. But even there, you're right about sequential prefetching being the sane default in (say) 95% of the cases.

I really don't know though, and it doesn't look like there's much public information about it (at least on wikichip).

Wikichip is a great resource for per-chip details and some light uarch details, but for in-depth stuff you have to go elsewhere. I don't know of any great single resource though. There's Anger's guides, and blog, this blog, scattered answers on StackOverflow, the Intel and AMD optimization guides, etc.

Yeah, I'd hope a wiki would be a place where a lot of that information gets aggregated, but that's life. I of course knew about Agner Fog's work, and recently saw the stuffedcow blog--great stuff.

I don't think people have used that much of my open source code at all :), but I haven't gotten such a note. I think most people just wouldn't tell you though. Apart from my stuff, I've definitely seen people promote and be more likely to use stuff that is "pure C" though, as it reduces some friction to use. That's why I used to try, but I've mostly regretted it and decided to go to C++, and maybe other langs (e.g., Rust) in the future - but I feel like you just run into the limitations of C fairly quickly. C++ gets a bad rap because of all the crazy metaprogramming and other inscrutable things that people like to do: but of course you are free to simply not do that!

Yeah, for a library this size, with no STL, no exceptions, no heap allocations, and only a bit of specialization through templates, it's probably not too objectionable. I'll keep a branch with the pure C version just in case, though.

Self promotion warning: you might also enjoy this article where I cover some of the common perf limiters on modern Intel and AMD chips (and really any OoO) chip, although it doesn't get into front-end/icache limits much at all (there's a section on FE effects, but I admit right up front that I'm glossing it over).

Wow, that's an awesome resource! I admire your attention to detail. I have pretty bad ADHD, and it's hard to imagine ever starting something like that and finishing. Same with that stuffedcow blog.

travisdowns commented 4 years ago

it's hard to imagine ever starting something like that and finishing

Basically I didn't finish, it just got too long so I just left out or glossed over any remaining stuff that I had planned and just posted it :).

zwegner commented 4 years ago

What I am quite interested in though is a good measure of code size, i.e., footprint. You can't just count the cache lines used statically, because some lines may not be touched at all. I don't think you can really count the number of lines touched dynamically either because there's a big difference between a 1000-line thing which executes 990 lines once and then spends the rest of the time looping on the last 10 lines, versus one that spends the entire time looping over all 1,000 lines.

In theory, this could be accomplished pretty easily (and with perfect accuracy) by compiler instrumentation: just increment a counter at the start of every basic block. You'd have to somehow make sure that the profiling version uses the same CFG (i.e. that the extra size of the counters doesn't trigger/not trigger some basic block coalescing optimization). The basic block counts could then be applied to the actual code layout without the counters to get how often certain cache lines are touched. You could probably also go a bit further and sample traces of basic block sequencing, to try and tease out some more global effects (like which cache lines tend to be mapped to the same set and thus evict each other frequently). I don't know if any of this is possible in existing tooling--probably not super difficult to do in LLVM, but more effort than I'd want to spend right now.

Or we figure out how to conjure a great simulator out of the ether...

travisdowns commented 4 years ago

I guess I would disagree here?

That seems like a tough way to do it. In particular, compile-time instrumenting your application and then trying to map the values you record from instrumented binary back to an uninstrumented version seems basically difficult-to-impossible.

E.g., this: "you'd have to somehow make sure that the profiling version uses the same CFG". I don't think current compiler technology can do that: the only feasible approach I could see would be to force the optimized uninstrumented version to have the CFG of the instrumented version, which is pointless because then you aren't gather stats for the optimized version anymore. I don't think the reverse is possible; the reverse is basically telling the compiler to compile a more complicated/richer program with the CFG of a simpler one, which doesn't seem possible in general.

Why not do this at the binary level? As above, SDE already does this, and basically things like pin tool can do this more or less trivially. Cachegrind does it for data and there are a ton of papers out here which record things like every data access, would be easy also for instruction access.

I didn't follow why this would involve a "great simulator" - the suggested compile-time approach doesn't involve a simulator: it is just recording basic block entries then (trying to) map that to icache line touches. So the binary instrumentation/emulation approach only has to meet that standard, which is trivial.

Yeah, true, either approach only gives you a list of touched lines: then you need to apply a model on top of that to understand the icache impact, but that's an independent problem, no?

zwegner commented 4 years ago

Regarding compiler instrumentation: I really don't know about the difficulty of modifying gcc/LLVM for this purpose, maybe it's actually hard. I think in theory such instrumentation should be fairly straightforward, though. Basically, set aside some global memory for counters, and every basic block starts with an increment of its unique counter. To ensure coherence between non-/instrumented code, this could be inserted at a late enough stage that the CFG is fixed, and both versions could be output from the same compilation. One complication that I just thought of is the case where flags need to be preserved across basic block boundaries: since add mem, imm modifies flags, you'd have to do a weird little shuffle of pushf/add/popf, assuming you can freely use the stack.

If SDE/pin can do this, though, all the more reason to just do that. The simulator thing was just another alternative to compiler instrumentation, without all the annoying crap above.

travisdowns commented 4 years ago

Regarding compiler instrumentation: I really don't know about the difficulty of modifying gcc/LLVM for this purpose, maybe it's actually hard. I think in theory such instrumentation should be fairly straightforward, though. Basically, set aside some global memory for counters, and every basic block starts with an increment of its unique counter.

Yeah I'm pretty sure this already exists several times over, for slightly different purposes. E.g., for regulard profiling, for PGO, and for coverage analysis, e.g, see all the instrumentation options in gcc. This is the "easy part" I thought.

To ensure coherence between non-/instrumented code, this could be inserted at a late enough stage that the CFG is fixed, and both versions could be output from the same compilation.

Right, after some more thought I think it is possible as you suggest. In my mind I was thinking about the "source level" basic blocks, and this end up radically transformed and often disappear after optimization, so any instrumentation that might interfere with it would be a problem (consider e.g., a scalar loop that can be vectorized, if you put some instrumentation on every arc (bb entry) then you ruin the vectorization and produce totally different code).

... but of course not - you are talking about the assembly level bb, so as you point out you can insert the instrumentation at some late stage where it's guaranteed not to affect the CFG and the dual output from one compilation solves most of my other concerns. I still think it's overkill compared to a binary-level instrumentation since those are widely available for this kind of task already, and don't require that you recompile everything with this option (a big plus for things like libc).

One complication that I just thought of is the case where flags need to be preserved across basic block boundaries: since add mem, imm modifies flags, you'd have to do a weird little shuffle of pushf/add/popf, assuming you can freely use the stack.

Maybe you could just put in before register allocation so it can get allocated a register. Yes it might arbitrary change the generated code, but that's already happening to a lesser extent (i.e., the instrumented code is already larger), so you already need the "reverse mapping" step, so maybe it's not a problem.

zwegner commented 4 years ago

I still think it's overkill compared to a binary-level instrumentation since those are widely available for this kind of task already, and don't require that you recompile everything with this option (a big plus for things like libc).

Definitely. Even if SDE isn't working for me, it'd sure be easier to work around that than writing a compiler extension.

Maybe you could just put in before register allocation so it can get allocated a register. Yes it might arbitrary change the generated code, but that's already happening to a lesser extent (i.e., the instrumented code is already larger), so you already need the "reverse mapping" step, so maybe it's not a problem.

I'm not sure allocating a register gets you anything...? add reg, imm also modifies flags, and each basic block gets an independent counter anyways. Without preserved flags, it should just be one instruction, incrementing a global symbol via a RIP-relative address.

I suspect (based on looking at disassembly for the ASCII-early-exit after incrementing a counter from the first post here) that LLVM doesn't support implicit flags carried across basic blocks anyways, so pushf/popf wouldn't be necessary.

travisdowns commented 4 years ago

I'm not sure allocating a register gets you anything...? add reg, imm also modifies flags, and each basic block gets an independent counter anyways. Without preserved flags, it should just be one instruction, incrementing a global symbol via a RIP-relative address.

Well what I really meant was to inject the "counter increment" in the IR before lowering to machine specific code, etc. So you wouldn't be thinking "inject pushf; inc [mem]; popf" but rather just putting the add [mem], 1 in "IR speak" at the place it belongs. At this point things like x86 registers don't exist, and neither do flags except perhaps in some abstract sense. Then the compiler is responsible for emitting the proper code. Normally that will just result in an inc [mem] without any flag manipulation because, as you point out, flags are not "kept" for very long in practice, and perhaps rarely or never across bb boundaries.

So you'd end up with more efficient code and it should work on any platform. Of course, a small change to the IR can have arbitrarily large "knock on" effects so in some cases the generated code might look quite different, but I think that's not a problem if I understood your suggestion.