xrmx / bootchart

merge of bootchart-collector and pybootchartgui
GNU General Public License v2.0
233 stars 88 forks source link

Does not produce a correct bootchart when kernel is built with CONFIG_TASKSTATS=n #53

Closed Max-E closed 10 years ago

Max-E commented 10 years ago

The README claims that bootchart2 will work with CONFIG_TASKSTATS=n, just not as well. However, when I try to generate charts on systems with CONFIG_TASKSTATS=n, I get a variety of wrong behaviors, depending on the version of bootchart2.

This is relevant when trying to instrument embedded Linux images generated by Yocto. See Yocto bug 5893: https://bugzilla.yoctoproject.org/show_bug.cgi?id=5893 Yocto can be configured to enable CONFIG_TASKSTATS, but it's off by default, so ideally we'd like bootchart2 to at least work partially without CONFIG_TASKSTATS.

The bug seems to go back a very long way. On versions starting with 0.14.2 all the way up through current master, the bootchart is completely blank. Example: (very tall image, just scroll past it to read the rest) bootchart_master_notask

On versions from 0.12.3 up through 0.14.0, the bootchart shows only the first five seconds, which is not correct. Example: (very tall image, just scroll past it to read the rest) bootchart_0 12 3_notask

In versions prior to 0.12.3, a very small (~20 KB) bootchart.tgz is created, and pybootchartgui complains that it does not contain a valid bootchart.

Meanwhile, in some versions, there are other bugs that seem to have come and gone but which stop me from getting proper test results.

For reference, a correct boot chart for the Yocto image I'm testing looks like this: bootchart_0 14 0_withtask

I am able to get correct boot charts for all versions tested if CONFIG_TASKSTATS is enabled.

Complete results:

bootchart2 version CONFIG_TASKSTATS=y CONFIG_TASKSTATS=n
0.0.4 Not tried No bootchart.tgz created
0.1.0 Not tried Parse error: empty state: '/home/maxtothemax/bootchart_0.1.0_notask.tgz' does not contain a valid bootchart
0.11.0 Not tried Parse error: empty state: '/home/maxtothemax/bootchart_0.11.0_notask.tgz' does not contain a valid bootchart
0.11.4 Correct bootchart Parse error: empty state: '/home/maxtothemax/bootchart_0.11.4_notask.tgz' does not contain a valid bootchart
0.12.0 Not tried Won't compile due to missing common.h
0.12.1 Not tried Parse error: empty state: '/home/maxtothemax/bootchart_0.12.1_notask.tgz' does not contain a valid bootchart
0.12.2 Correct bootchart Parse error: empty state: '/home/maxtothemax/bootchart_0.12.2_notask.tgz' does not contain a valid bootchart
0.12.3 Not tried Only first five seconds shown
0.12.6 Correct bootchart Only first five seconds shown
0.14.0 Not tried Only first five seconds shown
0.14.1 Not tried Blank bootchart
0.14.2 Not tried Blank bootchart
0.14.3 Not tried "Can't extract boot chart from collector"
0.14.4 Correct bootchart Blank bootchart
0.14.5 Correct bootchart Blank bootchart
Current master Correct bootchart Blank bootchart

So as you can see, I could not find a single version of bootchart2 which actually produced a correct bootchart with CONFIG_TASKSTATS=n, out of any of the ones I tested.

Fixing this bug could be as simple as rewriting the README to indicate that CONFIG_TASKSTATS is mandatory, if you do not intend to support kernels without it.

I can get you bootchart.tgz and bootchart.png files for all of these tests if you want them.

xrmx commented 10 years ago

One thing that would be useful to check is if the latest pybootchartgui is able to parse older than 0.14.0 tarballs to see if we have the first five seconds issue there too or so we regressed in the meantime. Could you please upload all these notask tarball somewhere please?

Max-E commented 10 years ago

Sure, here's everything I have: https://drive.google.com/file/d/0B_VvBSJusoCTcV9DZ3JuQ1VFZkE/edit?usp=sharing

EDIT: amended the first post, I'd accidentally used the wrong images.

xrmx commented 10 years ago

So here's what i found:

xrmx commented 10 years ago

So if you build bootchart-collector with this patch applied, what does calling it from a shell print with a kernel without taskstat?

diff --git a/collector/collector.c b/collector/collector.c
index 9c4a16a..855f6e8 100644
--- a/collector/collector.c
+++ b/collector/collector.c
@@ -464,6 +464,7 @@ static int get_family_id(int sd)
        if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
                id = *(__u16 *) NLA_DATA(na);
        }
+       printf("family id %d\n", id);
        return id;
 }

@@ -829,6 +830,9 @@ int main (int argc, char *argv[])
                return 0;
        }

+       printf("use taskstat: %d\n", init_taskstat());
+       return;
+
        if (enter_environment (args.console_debug))
                return 1;
Max-E commented 10 years ago

The result is one line: use taskstat: 1

So init_taskstat is returning 1, and get_family_id is returning prior to the last line.

The reason for this problem is that the global variable netlink_taskstats_id is being declared as __u16, an unsigned type. So even if get_family_id returns a negative number to indicate an error, the check for netlink_taskstats_id > 0 on line 488 won't catch it.

I have written a patch for this bug and will submit it shortly. It seems to resolve all remaining issues.

I do get valid bootcharts without TASKSTATS now: bootchart_patched In fact, this chart seems more complete, because it includes the first five seconds or so. However, I am seeing a lot of extra pidof and bootchart-collector processes in this chart, compared to the ones with CONFIG_TASKSTATS=y. Is that by design?

xrmx commented 10 years ago

In fact, this chart seems more complete, because it includes the first five seconds or so. However, I am seeing a lot of extra pidof and bootchart-collector processes in this chart, compared to the ones with CONFIG_TASKSTATS=y. Is that by design?

Nope, it's a bug :) That process should not appear in the graph, see elide_bootchart in pybootchartgui/draw.py

Max-E commented 10 years ago

I don't think I'm really able to troubleshoot that at this time, due to time constraints. But you should be able to force get_family_id in collector.c to return 0 to simulate no TASKSTATS, so you can reproduce the problem yourself and troubleshoot it.

xrmx commented 10 years ago

Sure i have the data i can reproduce and hopefully fix, i'd just like to make clear it's a bug :)

xrmx commented 10 years ago

Moving specific issues with collector processes not merged to #56