xrmx / bootchart

merge of bootchart-collector and pybootchartgui
GNU General Public License v2.0
233 stars 88 forks source link

main boot-collector process exits because of race condition with previous boot-collector process with --usleep arg detected #25

Open hyei opened 13 years ago

hyei commented 13 years ago

I've observed that bootchartd is unable to dump logs about 10% of the time because the bootchart-collector process isn't available. It looks like the process that is supposed to be there exits because there is a bootchart-collector process with the argument --usleep 5000 still running. This is from the wait_boot() function thats put in the background right before.

I've added -c to /sbin/bootchartd and a fprintf at the beginning of main() in collector.c to narrow down the issue. [partial output] Starting bootchart logging bootchart: no initrd used; starting bootchart-collector started as pid 415 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 414 with 2 args: '-c' '50' bootchart console.debug [1] bootchart-collector started as pid 413 with 3 args: '-c' '--usleep' '250000' bootchart-collector started as pid 414 with 2 args: '-c' '50' bootchart-collector run outside initrd bootchart collector already running as pid 416, exiting... bootchart-collector started as pid 416 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 417 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 418 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 419 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 420 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 421 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 422 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 423 with 3 args: '-c' '--usleep' '5000' bootchart-collector started as pid 424 with 3 args: '-c' '--usleep' '5000'

.....

bootchart-collector started as pid 1207 with 3 args: '-c' '--dump' '/tmp/bootchart.XXXXN2niDZ' bootchart-collector started as pid 1207 with 3 args: '-c' '--dump' '/tmp/bootchart.XXXXN2niDZ' Failed to find the collector's pid bootchart-collector pid: 1207 unmounted proc / clean exit Can't extract boot chart from collector

mmeeks commented 13 years ago

Riiight - it is possible that we are mis-filtering collectors that have the --usleep still running - but there should be code to detect and simply skip over those. Quite possibly that code doesn't work when they are exiting [ perhaps there is some race condition whereby the cmdline-args are destroyed by glibc before the kernel kills the process or something ? ]. If so that would be an interesting angle [ and yes the intermittent nature is deadly annoying - help fixing it much appreciated ]. Could you enable the debug to collector/dump.c in bootchart_find_running_pid to see if we can get something out of that ?

In the medium term, I had hoped to move towards killing the nasty shell-script wrapper, and moving ever more of that into the collector process itself (into plain C) - any progress you can help out with there is much appreciated !

hyei commented 13 years ago

The call to fread() on the cmdline file in the proc entry of the process with the --usleep argument returns a length of 0 instead of the expected len of "--usleep". I think it is somehow catching the process so early in its execution that the proc entry hasn't been completely initialized yet.

len = fread (abuffer, 1, 4095, argf);

Annoyingly enough, the more debug code I throw in, the harder it is to reproduce.

I could kill the loop at the beginning of wait_boot, because I've modified the script to wait X seconds as the exit criteria which will fix it for me, but will give some thought on the race condition.

mmeeks commented 13 years ago

Wow - that is helpful; so yes there are races and unfinished state at start/finish of processes which is sad/broken but hard to avoid I guess.

I think that (really) the sol'n is to bin the use of bootchart-collector --usleep in the code - perhaps we can do this with an installed symlink - so the process has a different name, and then do the usleep mode on argv[0] ? so ln- -s bootchart-collector busleep; busleep 1000 etc. ?

hyei commented 13 years ago

The process will still have the same name, at least as seen under proc. When busleep is executed, /proc/1269/exe is still a link to exe -> /lib/bootchart/bootchart-collector.

I think there will have be a change in the bootchart_find_running_pid function to use some other way to detect whether there is already a bootchart-collector process collecting data.

mmeeks commented 13 years ago

Riight - in reality I think it should be fairly trivial to remove the need for usleep. Currently we use it in bootchartd.in for: "wait_boot()" - we could easily have a collector --wait-boot type call here that could be implemented in native code [ ideally sharing the same task watching stuff ]. Then of course there is the "let the collector get started before letting the system run on" wait - which is itself flawed; IMHO we should spawn the collector synchronously, and it should call daemon(); when it is ready to let the system continue booting.

Patches much appreciated [ of course ]. I guess it is the 'wait_boot' ones that are most problematic.

xrmx commented 13 years ago

What about this? This is only compile tested and break initrd case. http://people.freedesktop.org/~rm/bootchart2/move-wait-boot-to-collector.diff

mmeeks commented 13 years ago

Patch looks interesting to me, I think we need to ',' delimit arguments of EXIT_PROC (and in strtok) otherwise they'll be separate arguments ;-)

I don't quite see how the two steps - running the collector inside some odd chroot, and running the dump phase from some later date / in a different root works out - but if you say it does :-) that's all good.

xrmx commented 13 years ago

Yeah, this is more of a proof of concept and some more work is needed to have it work.