Three pprof enhancements

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run pprof on a heap profile that has lines like: "0: 0 [42: 239842]
0x..." and observe a division by zero error.
2. Run pprof with default options and observe that addr2line is called even
though --functions is the default (not --lines).
3. Run pprof on a large binary that links against numerous shared
libraries, and observe extremely long run times.

What version of the product are you using? On what operating system?

I am running pprof from google-perftools svn r91 against profiles generated
by jemalloc (http://www.canonware.com/jemalloc/) on Ubuntu 9.10 and Centos
5.2 systems.  The use of jemalloc is relevant only to (1).

The attached patch provides fixes for (1) and (2), and it dramatically
improves run time for (3) by reducing the typical algorithmic complexity of
the ExtractSymbols subroutine.

Original issue reported on code.google.com by joev...@gmail.com on 12 Apr 2010 at 5:54

Attachments:

pprof.diff

GoogleCodeExporter commented 9 years ago

Thank you for the patches and the clear bug report!  (3) makes sense to me, and 
it 
looks like (1) is a problem for you because you're emitting heap_v2-style 
profiles, 
which is not what pprof tests against (it probably should, and in fact the heap 
profiler should really be rewritten to use v2 sampling).

As for (2), the reason we prefer addr2line is that it can properly handle 
inlining 
(when used with -i), which nm can't do, so even in --functions mode it's worth 
running.  Is there a problem using addr2line, that you want to remove the call 
to it?

Also, can you sign the CLA at 
http://code.google.com/legal/individual-cla-v1.0.html?  
This will allow us to accept your patches as-is.

Original comment by csilv...@gmail.com on 12 Apr 2010 at 6:29

Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

(2) (the addr2line change) was primarily intended as a performance improvement. 
addr2line takes about 15 minutes for some of the binaries I'm working with, so 
this
makes a pretty big difference in usability for me.  Is there perhaps another 
approach
to shutting off addr2line that you would find more acceptable?

I signed the CLA.

Original comment by joev...@gmail.com on 12 Apr 2010 at 11:15

GoogleCodeExporter commented 9 years ago

} addr2line takes about 15 minutes for some of the binaries I'm
} working with, so this makes a pretty big difference in usability for
} me.

Wow, that's a surprisingly long time.  I take it you don't need the
functionality that correctly attributes memory use to inlined
functions?

There are several options I can think of off the top of my head.  One
is to add a flag saying not to use addr2line, in the same way this
line does it already for OS X:
    $obj_tool_map{"addr2line"} = "false";  # no addr2line

Or perhaps we could add a flag that lets folks specify values for all
of obj_tool_map on the commandline.

You can also get the same result right now, without any code changes
at all, though it requires a bit of setup on your end.  You have to
set up a directory with symlinks to objdump, nm, and (perhaps) c++filt
in it, but *not* addr2line.  You then pass in this directory to the
--tools flag of pprof (or optional set the env var PPROF_TOOLS).

Would this work for you?  That's my favorite solution, I think, if
it's practicable.

Original comment by csilv...@gmail.com on 13 Apr 2010 at 12:10

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Any more word on this?  How do these proposed solutions work for you?

Original comment by csilv...@gmail.com on 7 Jun 2010 at 11:24

GoogleCodeExporter commented 9 years ago

The --tools and PPROF_TOOLS options aren't ideal, because they require setting 
things up in advance.  Adding a way to specify $obj_tool_map on the command 
line would be slightly better, but all-or-nothing tools configuration isn't 
very convenient for my use case.  I'd prefer to add support for something like 
--tool=addr2line:false so that individual tools can be specified.  This seems 
sufficiently general to meet other use cases as well.  What do you think?

Original comment by joev...@gmail.com on 8 Jun 2010 at 12:11

GoogleCodeExporter commented 9 years ago

Thanks for the feedback.  I think there's a feeling, generally, that the pprof 
commandline is already too complicated, so adding a new flag like that may not 
go over so well.  But I'll suggest it to other pprof developers and see what 
they think.

Original comment by csilv...@gmail.com on 8 Jun 2010 at 6:05

GoogleCodeExporter commented 9 years ago

Original comment by csilv...@gmail.com on 8 Jun 2010 at 6:49

Changed state: Started

GoogleCodeExporter commented 9 years ago

This should be fixed in perftools 1.6, just released (by augmenting the syntax 
of the --tools flag).

Original comment by csilv...@gmail.com on 5 Aug 2010 at 8:50

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks for augmenting the --tools syntax!

Unfortunately, it looks like the divide-by-0 error and the performance 
improvement for ExtractSymbols in the patch attached to this issue were not 
integrated.

Original comment by joev...@gmail.com on 6 Aug 2010 at 12:26

GoogleCodeExporter commented 9 years ago

You are right, I totally forgot about those. :-(  I am leaving on vacation 
shortly, but will add those in when I return, and push them to svn.  They'll 
also be part of the next release.

Original comment by csilv...@gmail.com on 7 Aug 2010 at 2:41

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

OK, I've applied the divide-by-0 fix -- though I don't understand why it's 
necessary to protect against n1 being 0 but not n2.

But I'm confused by the performance improvement.  As I understand it, you're 
just sorting the libraries and the symbols, and going through them that way.  
But I don't understand the point of @pcs2.  Once you see $pc > $finish, you 
know it's not going to be in any library (since your current library is one 
with the greatest finish).  So why keep these values in @pcs2, and add them 
back onto @pcs?  When would this a pc in @pcs2 ever actually be used?

Also, why sort the libraries in reverse order?

If I were implementing this, I'd do something like this:
   @pcs = sort ...;
   @libs = sort ...;
   for $lib (@libs) {
     $start = ...;
     $finish = ...;
     $start_pc_index = <first index into @pcs >= $start>;
     $finish_pc_index = <first index into @pcs < $finish>;
     @contained = splice(@pcs, $start_pc_index, $finish_pc_index - $start_pc_index);
     Map...();
   }

Does this seem like it would work, to you?

Original comment by csilv...@gmail.com on 31 Aug 2010 at 1:06

GoogleCodeExporter commented 9 years ago

n1 is the current object count, which can be 0, but n2 is a cumulative count, 
which is strictly greater than 0.

The old version of the pc-->lib mapping code was effectively an O(L*P) 
algorithm (for each lib, iterate over pc's and find all that are contained by 
lib), whereas the new version is O(L+P) (plus the cost of sorting, O(L*lg L) + 
O(P*lg P)).  The new version uses sorted pc's so that only a subset of pc's has 
to be examined for each lib.  My notes indicate that for the binaries I was 
testing with, ExtractSymbols accounted for ~24% of execution time before this 
change, and something less than 2.5% afterwards (it dropped out of the top 15 
functions in the CPU profile).

My memory is fuzzy now on the reason for reverse sorting (and for @pcs2), but 
it had something to do with the address ranges for libs overlapping with that 
reported for the main binary.  I tested the changes on 32-bit Ubuntu and 64-bit 
CentOS, and for one of those platforms pc's were preferentially associated with 
the binary rather than the libraries they came from (and symbol lookups would 
then fail) unless libraries were processed from high to low addresses.  Without 
@pcs2, it was possible for pc's to be dropped without mapping them to symbols, 
even though symbols were available from the main binary.

Original comment by joev...@gmail.com on 1 Sep 2010 at 5:56

GoogleCodeExporter commented 9 years ago

OK, that makes some sense (overlapping libraries).  Does the splice 
implementation seem reasonable to you?

I don't think your algorithm is really O(L+P), because you could in theory do 
|P| work for each library addings pcs back onto pcs2.  But I think in practice 
pcs2 will usually be empty, so I can believe it would be fast in practice.

I've written another version of the patch which I think does the same thing but 
I found a bit easier to follow.  Can you take a look and see if you agree?  
Also, it would be great if you could test it on the same test suite you used 
for your own patch, and verify it doesn't cause any problems (I tested it a 
bit, but it looks like your test suite is more thorough).

Original comment by csilv...@gmail.com on 1 Sep 2010 at 11:49

Changed state: Started

Attachments:

pprof.patch

GoogleCodeExporter commented 9 years ago

Another thought: what if you sorted the libraries by $finish (a[2]) rather than 
$start (a[1])?  Would that solve the problem you were seeing with overlapping 
binaries?  Then we really could just do a 'merge' sort, with one last cleanup 
step at the end for the final binary.

Original comment by csilv...@gmail.com on 1 Sep 2010 at 11:52

GoogleCodeExporter commented 9 years ago

The splice implementation seems reasonable, but it potentially passes the same 
pc into multiple MapToSymbols calls.  That may actually be the right thing to 
do (and if so, the sorting order can be changed as you suggest), but I don't 
know enough about the results of the nm and addr2line processing to know what 
the result is.  At a glance, it looks to me like it is possible for $symbols to 
contain multiple resolutions for a single pc, but a comment ("Prepend to 
accumulated symbols for pcstr") implies that getting the order wrong will 
result in incorrect results.

If it is *not* okay to pass the same pc to MapToSymobls multiple times, then I 
don't think it's safe to change the way sorting is done.  I vaguely remember 
trying the sorting strategy you suggested, and some pc's were not resolved to 
symbols.

Original comment by joev...@gmail.com on 2 Sep 2010 at 12:08

GoogleCodeExporter commented 9 years ago

I'm missing something I think.  How can splice pass the same pc into multiple 
MapToSymbol calls?  splice() removes the items from the input list that it 
returns; were you taking that into account?

Original comment by csilv...@gmail.com on 2 Sep 2010 at 12:29

GoogleCodeExporter commented 9 years ago

Nope, I had incorrect splice semantics in mind.

In your patch, you determine $start_pc_index by iterating forward from the 
beginning of @pcs.  This will tend to be less efficient than determining 
$finish_pc_index first, then iterating backward from there to determine 
$start_pc_index.

Original comment by joev...@gmail.com on 2 Sep 2010 at 12:58

GoogleCodeExporter commented 9 years ago

True.  OK, rewritten and attached.  Can you test this on your various OS 
environments and make sure it's behaving as expected?

Original comment by csilv...@gmail.com on 2 Sep 2010 at 8:54

Attachments:

pprof.patch

GoogleCodeExporter commented 9 years ago

I had to make a fix to your patch, but with that change in place, I got correct 
results for my tests, with similar performance to the version of ExtractSymbols 
that I submitted.

Original comment by joev...@gmail.com on 8 Sep 2010 at 10:28

Attachments:

pprof.diff

GoogleCodeExporter commented 9 years ago

Ok, I think this is all in perftools 1.7, just released.

Original comment by csilv...@gmail.com on 5 Feb 2011 at 12:21

Changed state: Fixed

yodamaster / gperftools

Three pprof enhancements #234