zyedidia / perforator

Record "perf" performance metrics for individual functions/regions of an ELF binary.
MIT License
66 stars 5 forks source link

[Feature request] Perforated --topdown #9

Open gabriel-rodriguez opened 1 year ago

gabriel-rodriguez commented 1 year ago

First of all: thank you for developing this. I am mostly used to manually using PAPI to monitor PMCs and the main reason why I don't use Perf is because there's no good way to delimit a region of interest (at least, that I know of).

What I was trying to do now was to apply a --topdown analysis to a region of code. I have looked around in the documentation and I think this is not currently supported by the tool. Would it be simple to add this functionality to the current tool?

zyedidia commented 1 year ago

I'm not familiar with top-down analysis, could you give a bit more detail about that? It's unlikely I will implement it, but I would be happy to merge a pull request.

gabriel-rodriguez commented 1 year ago

Top-down Microarchitectural Analysis (TMA) is a performance analysis methodology developed by Ahmad Yasin at Intel. It's quite simple: you measure a very particular set of counters to determine which of four main bottlenecks your application features: front-end bound (you cannot deliver instructions fast enough to the back-end), back-end bound (classical, either compute-bound or memory-bound), speculation bound (your code keeps mispredicting branches and most of the work you do is misspeculated), or retiring (that's where we want to be, our code cannot be faster because we fully use the capabilities of our CPU).

For a more detailed description, I think the best reference is the original paper in ISPASS2014.

The implementation itself is as simple as measuring a predefined set of PMCs and computing the predefined metrics. Both Intel VTune and Perf have incorporated TMA into their pipelines in the last few years. In VTune you choose "Microarchitecture exploration". In Perf you run perf stat --topdown.

zyedidia commented 1 year ago

Ok, seems simple enough. The PMCs are presumably ones that you can already measure with perforator, so you could manually compute the metrics with those results already? Is the main request here to add support for automatically computing the metrics in perforator?

gabriel-rodriguez commented 1 year ago

The problem is that the exact counters to measure vary on a per-architecture basis. I am not sure whether the underlying implementation of perforator directly calls perf, and in that case the ideal implementation would be to just execute --topdown. Otherwise, it's a bit trickier.

zyedidia commented 1 year ago

Do you mean on a per microarchitecture basis? Because Perforators only supports x86-64 anyway. Perforator does not call perf — it directly uses the PMU api exposed by linux, so not as easy as just calling perf topdown unfortunately.

gabriel-rodriguez commented 1 year ago

Yes, sorry, I meant microarchitecture indeed. There's even a dedicated TMA spreadsheet that details how to compute the relevant counters per uarch, but I'm afraid it looks like a painful feature to maintain. If perforator uses PMU directly it will not be as easy as I had hoped.