Instrumentation address clash errors and segfaults

thedavidmeister commented 5 years ago

as seen in failing builds https://circleci.com/gh/holochain/holochain-rust/1514

using 0.6.11 and nightly-2018-12-26

i've tried various flags passed to tarpaulin, but builds always seem to fail with this error

also seeing Error a segfault occured when executing test

potentially related https://github.com/xd009642/tarpaulin/issues/35 as we are using threads

appaquet commented 5 years ago

I agree... Personally, I have switched to grcov for the time being... It's a bit trickier to get working, but it's stable once setup (see my script here and its usage in GitHub actions here).

I'll happily switch back to Tarpaulin for its simplicity once the issue has been fixed.

xd009642 commented 5 years ago

I know but unfortunately I'm still stuck on how to actually solve this, plus big job change and move has kept me busy irl :cry: I'll dedicate serious time to this this weekend and see if I can make any progress past my multiple not working experiments

jonhoo commented 5 years ago

@xd009642 It's okay, life happens, and your well-being is the most important! I'm sure if you gave some instructions for how others can help track this down, there are some people here who'd love to help :)

xd009642 commented 5 years ago

So you can't debug tarpaulin with gdb because ptrace can't trace ptrace using the --debug flag on tarpaulin and the dwarf dumps/disassembly from objdump on the test binary try to reconstruct the behaviour of the program and work out where there could be essentially threading conflicts (because binary is multithreaded but ptrace gives you a single threaded view into it).

Another route is working out how kcov and tarpaulin differ and seeing if the differences could be the root of the problem. The developer wiki ptrace section may be of help. And all the previous comments in this thread

xd009642 commented 5 years ago

So I thought one of the patches to get coz-rs working might solve this (increasing the signal stack as overflowing it causes segfaults). Didn't work but the process of trying it lead me to get some stats on how often this happens. I'm observing it 11% of runs with the minimum example from before:

#![feature(async_await, await_macro, futures_api)]

#[test]
pub fn a() {
    futures::executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures::executor::ThreadPool::new();
}

Hopefully now with a test for this in one of my local branches and vague statistics on how often it occurs I'll be able to find when I'm on the right track easier! Current work is on the branch segfault-investigation

xd009642 commented 5 years ago

It didn't make an improvement at all... To try and aid in debugging for me and anyone who wants to take a crack of it I've gutted tarpaulin keeping only the ptrace and statemachine stuff and added some code to push events into a timeline of what happens and generate a timeline with gnuplot https://github.com/xd009642/minitarp it's still very much a WIP but here's some example output with the futures project in tests/data/futures in the branch segfault-investigation.

Currently, the most interesting thing to me is why some threads appear but I don't appear to get a ptrace clone event :confused:

output

xd009642 commented 5 years ago

X Axes ticks are now in minitarp's graphing, also put in more of the ptrace events so you now get the WaitStatus's I get in order, the data intepretted from that and the actions taken. Also, this blog post begins to hint at some of the difficulties faced with ptrace https://www.cyphar.com/blog/post/20160703-remainroot-ptrace-hell

output

Part of me is hoping this is approaching the level of info for someone smarter than me to tackle this for hacktoberfest :joy: but I'm gonna soldier on regardless.

EDIT: Something else I just noticed RIP 0x4043c0 is a line from test case b, the previous two points are the step and continue for test a. So I have test-threads set to 1 for the inferior test so it's going from test case a to b in the tracing before the threads for test-case A have been picked up

xd009642 commented 5 years ago

So I just got output that passes no segfaults for the first time! Still analysing results and figuring it out but here's a picture for interested parties of the run

output_pass

xd009642 commented 5 years ago

you can also see the changes and test them here https://github.com/xd009642/tarpaulin/pull/259 . This so far hasn't failed once for me and I've done 100 runs of the minimal example with 4 tests creating threadpools! This previously failed 100% of the time for me (2 tests failed 10%)

xd009642 commented 5 years ago

I'm now running tarpaulin on a project with 9 tests using thread pools 100 times. Once this is done if there are no failures I'll merge the branch into develop

xd009642 commented 5 years ago

And it passed! I'll keep it in develop until the 3rd and then I'll merge it, just to give anyone a chance to find issues. I'm very happy with this currently though I'm seeing tarpaulin do things it's never done before :smile:

elinorbgr commented 5 years ago

:tada:

Btw, do you plan to write some explanation of what the issue was? I (at least) would find this interesting to know. This is quite puzzling as it appears that the only significant change in #259 is the removal of RUST_TEST_THREADS=1 ... ?

xd009642 commented 5 years ago

@vberger I'll do a more thorough write up and consider posting it as a blog or just put it here. But this fix is two parts, the first is the queue of pending wait signals that I did a few months ago which fixed a big tarpaulin issue with multi-threading and is what allowed me to remove the test_threads=1 limitation. The test_threads 1 is based on how libtest has changed it's test runner and I'll also go into details on that :smile:

xd009642 commented 5 years ago

https://xd009642.github.io/2019/10/02/Tarpaulin-and-the-futures.html

jonhoo commented 5 years ago

@xd009642 Could you also publish a new Docker image? That way I can start taking advantage of this immediately! :D

xd009642 commented 5 years ago

It's on the develop tag, master coming in tonight since there's been no reported regressions

xd009642 commented 5 years ago

New version 0.9.0 is being released as well speak and docker image will be updated as part of that process once the really long travis build finishes!

It's been "fun" but it's now time to close this issue for good!

elinorbgr commented 5 years ago

I have some bad news : tarpaulin 0.9 segfaults on several of wayland-rs tests, including several test files that do not do any multithreading. I'll try to make some reasonably small example.

elinorbgr commented 5 years ago

Here is a repro:

use std::sync::Arc;

fn do_it() {
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
    let _a = Arc::new(0);
}

#[test]
fn test_1() {
    do_it()
}

#[test]
fn test_2() {
    do_it()
}

Running tarpaulin 0.9 on this regularly (30-40% of the time) results in:

Error: "Failed to get test coverage!
Error: Failed to run tests:
Error running test - SIGILL raised in 40531"

And sometimes also results on what appears to be a deadlock, or even the usual:

Error: "Failed to get test coverage!
Error: Failed to run tests: A segfault occurred while executing tests"

Increasing the number of Arc::new(...) in the body of the function seems to increase the chances of a segfault or SIGILL occurring.

clux commented 5 years ago

While there might be more issues with this ^, I just wanted to say thanks a lot for all the hard work on this release. This fixed all the issues I had with tarpaulin on a mid-sized project that had heavy use of rayon. It still crashes with --test-threads 1, but removing that flag actually worked :+1:

hntd187 commented 5 years ago

Although this is being reopened, my build now works beautifully with the new change and my project makes some pretty extensive use of the futures ecosystem. Amazing work

https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=354

xd009642 commented 5 years ago

@vberger I'm getting much lower incidence rates for the failure (2/1000 attempts failed) and I've got 25 calls to Arc::new and 4 tests calling do_it. There's a chance this one may be trickier

elinorbgr commented 5 years ago

Indeed, the reproducibility seems to depend on the computer running the test, on my other computer it is much harder to reproduce.

Still, I managed to reproduce it a few times with the do_it() function being simply

fn do_it() {
    for _ in 0..10000 {
        let _a = vec![0];
    }
}

and running it 8 times in parallel.

Given that now there is nothing especially multithreaded in these tests, I strongly suspect that tarpaulin somehow interferes with the behavior of the allocator. This is possibly the same issue cause as #264 ?

jonhoo commented 5 years ago

Hmm, this is interesting. In evmap, I now get:

cargo tarpaulin --features "" --out Xml
========================== Starting Command Output ===========================
/bin/bash --noprofile --norc /__w/_temp/57325ed4-7e46-4e9b-b76a-e249415e0eef.sh
[INFO tarpaulin] Running Tarpaulin
[INFO tarpaulin] Building project
[INFO tarpaulin] Launching test
[INFO tarpaulin] running /__w/1/s/target/debug/deps/lib-94fc5ea0bf28285e

running 23 tests
Error: "Failed to get test coverage! Error: Failed to run tests: Error running test - SIGILL raised in 369"

So no segfault any more, but an illegal instruction. That seems even weirder!

jonhoo commented 5 years ago

Inferno still seems to hit the segfaul though. This is all running with the latest Docker image.

xd009642 commented 5 years ago

So there's a chance the SIGILL and SIGSEGV might be the same issue just manifesting in slightly different ways. I've got an experiment currently running to see if I've made any progress or not. If I have I'll push the branch and ask people here to test it out on their own projects :eyes:

xd009642 commented 5 years ago

Ok so ran on rust-evmap as @jonhoo mentioned it previously. With the change in the branch affinity did 100 runs with no SIGILL or SIGSEGV. Running on the current latest tarpaulin on crates.io and I can already see some failures before finishing the 100 runs.

So yeah if anyone wants to try it out and report back that would be appreciated!

Edit results for the latest release. Failed 11/100 times!

xd009642 commented 5 years ago

Another 100 runs on evmap and no failures. Now trying inferno (will edit with some results however far it gets before I shutdown).

EDIT: There appear to be failing tests on inferno. Maybe something I'm missing dependency wise? This kinda messed up my failure rate script but when I did some runs without suppressing the output I didn't see any segfaults... @jonhoo you mind trying it out and letting me know? Otherwise when it gets to the weekend I'll merge as it doesn't have a detrimental effect.

jonhoo commented 5 years ago

@xd009642 I'm away at a conference, so may be hard for me to find the time, but my guess is that the test failures is due to git submodules. Just run git submodule update --init and the tests should work! Also, awesome work on this. I'm so excited to see tarpaulin working again :D

xd009642 commented 5 years ago

I'll try that tonight. And have fun at the conference 👍

xd009642 commented 5 years ago

Yeah that was it. And I'm doing a run of 100. With latest master tarpaulin it segfaulted every time I tried. So far this fix has worked every time (but that's just 3 times). I'll let it get through all 100 runs but if it passes that I'm merging and might do a release tonight :eyes:

Edit: Release tonight is coming, >1000 runs not a single segfault or sigill!

jonhoo commented 5 years ago

It appears to be working! Thank you!

xd009642 commented 5 years ago

Nice! So as this issue ended up resolving two different issues I'm closing it and if there's another segfault or sigill a new issue should be opened :tada:

xd009642 / tarpaulin

Instrumentation address clash errors and segfaults #190