Closed thedavidmeister closed 5 years ago
So the instrumentation clash warnings may not be errors as such, a multiple lines of code may map to the same addresses and that just warns on occurrences of that which may affect the accuracy of results. And #35 has been partially mitigated by making --no-count
the default so it disables the breakpoint on a line after that line is hit.
I'm cloning the repo and having a look into what's happening and will report back if I find anything/solve it.
The segfault also happens to me and here I thought my CI was going insane. See https://drone.exqa.de/Cogitri/tmplgen/89/3/3
@thedavidmeister I haven't managed to run your tests yet getting a compile failure I need to look into.
@Cogitri there appears to be a timeout on your test test_bad_env
I wonder if this might be leaving things in an odd state that causes the segfault in test tests::test_crate_or_licensing
. I'll need to look more into it
hi, a bit more info based on my testing
there are 3 issues that started popping up:
instrumentation clash could be fixed by disabling thinlto
segfaults could be fixed by disabling incremental compilation
still not sure about the different threading behaviour yet
ah no, i'm wrong, segfaults still happening :(
looks like the threading issues are partly our fault
it's trying to unwrap
a None
of something stateful that needs to be initialised in our setup
i have no idea why that would work fine in regular cargo test but fail with tarpaulin
oh, and it worked with tarpaulin about a month ago because we ran all our CI through it
so something changed, but also we're handling that change ungracefully at our end
ok, i have a plan to fix our side of things
just need a hand with the segfaults... i'm not even sure where/how to debug those, they appear in different places on CI vs. local for example
I'm trying to add cargo-tarpaulin
to the xi-editor project and I also received a segfault: https://travis-ci.com/xi-editor/xi-editor/jobs/169995236.
So I looked at tmplgen - being the smaller and simpler project and the issue goes away when I replace rayon's par_iter
with iter
making me think this might be another threading based issue. This may be a fiddly one with a fair bit of investigation so I'll keep you all posted.
Also knowing these sort of issues there is always the chance it may disappear after another nightly upgrade...
I replace rayon's par_iter with iter making me think this might be another threading based issue
Ah, yes, makes sense timing wise. Thanks for figuring that out!
Thanks for looking into this @xd009642! In case you're looking for a minimal repro, the snippet below seems to be enough to cause tarpaulin to segfault.
#![feature(async_await, await_macro, futures_api)]
#[test]
pub fn test() {
futures::executor::ThreadPool::new();
}
@brunocodutra is that using futures-preview
0.3.0-alpha.12
? Because I tried that minimal example and experienced no segfault on my system with the latest nightly.
0.3.0-alpha.12
yes.
Now that's interesting, this definitely triggers a segfault if I add it to reducer, in fact I reduced it from a segfault I first observed in brunocodutra/reducer#11. However I can indeed confirm it doesn't appear to cause any issues on an otherwise empty project.
I'll try a ground up approach this time, pulling in bits from reducer until I can reproduce it.
@xd009642 I think I got it, the issue seems to only manifests itself if there are at least two separate test cases spawning threads, so the snippet bellow triggers the segfault for me on an empty project, not on every execution, but very frequently.
#![feature(async_await, await_macro, futures_api)]
#[test]
pub fn a() {
futures::executor::ThreadPool::new();
}
#[test]
pub fn b() {
futures::executor::ThreadPool::new();
}
EDIT: the issue also manifests itself by spawning std::thread::Thread
s directly, albeit less frequently, so I assume the number of threads spawned is an important factor.
For some more context, using the currently 0.7.0 release, I have some of my test suite that segfaults when run on travis, but not on my local computer.
both travis and my computer running rustc 1.33.0-nightly (daa53a52a 2019-01-17)
.
The test file causing segfaults on travis contains spawning a thread with std::thread::spawn()
.
Same here. I only see the segfaults on travis.
I tried to make some bissecting on trapaulin to see if this was a recently introduces issue, all that I can say is that version 0.6.8 already triggers a segfault on @brunocodutra 's example. (I couldn't try earlier versions because I can't simply upgrade them to cargo 0.32 which is required to compile the example using futures-preview).
All this was done on rustc 1.33.0-nightly (01f8e25b1 2019-01-24)
, but tarpaulin 0.7.0 also displays the issue when compiled on stable rustc 1.32.0 (9fda7c223 2019-01-16)
.
I have something more regarding the repro. I can reproduce it with the following code that compiles on stable, using futures-executor = 0.2.1
:
#[test]
pub fn a() {
futures_executor::ThreadPool::new();
}
#[test]
pub fn b() {
futures_executor::ThreadPool::new();
}
The interesting thing is that, if I run cargo tarpaulin
(so using the stable compiler, rustc 1.32.0 (9fda7c223 2019-01-16)
), I can't seem to trigger the segfault, while running cargo +nightly tarpaulin
triggers the segfault roughly 1/4 of the time.
In both cases I use the cargo-tarpaulin
binary built from rust stable, I only change the compiler used to compile the tests.
Actually, the segfault also occurs when running cargo +beta tarpaulin
, so on rustc 1.33.0-beta.5 (1045131c1 2019-01-31)
.
I looks like something was introduced in rustc 1.33.0 that breaks tarpaulin.
Perhaps you could do docker image on rust:stable for the time being then, @xd009642?
There should currently be a new images building for the stable and nightly compilers on the develop branch. Just waiting on dockerhub
So as a slight update I've added extra debug logging to help diagnose internal issues in tarpaulin activated via the --debug
flag (develop branch only for now). It can spew a ton of info so anyone who wants to add anything to the issue attach a file if it's long or post a link to a gist.
For all the examples on this issue which recreated the issue for me, occasionally I had a segfault and other times tarpaulin just ended up hanging and had to be killed. I've figured out what was causing it to hang! I was assuming anything that wasn't a trap or a segfault was a ignoreable/forwardable signal. The times tarpaulin was hanging was because of a SIGILL - illegal instruction.
Hopefully, now I've got an area to focus in on it shouldn't take me too long to resolve this!
So I was debugging this to try and figure out what compiler change exactly caused it. Here are some notes that might help you:
In @vberger's test case, there were 2 kinds of crashes I could find from the core dumps:
1.1. crashes in ThreadPoolBuilder::create
, where the problem was
These crashes have this form:
0x7fffffffcb40: 0x007d2210 0x00000000 0x00000001 0x00000000
0x7fffffffcb50: 0xffffcff0 0x00000000 0xffffcff0 0x00000000
0x7fffffffcb60: 0xffffcf90 0x00007fff 0x00000000 0x00000000
0x7fffffffcb70: 0x007d0910 0x00000000 0x00000000 0x00000000
0x00000000004196ca <+90>: mov %rdi,0xb8(%rsp)
...
0x00000000004199c7 <+855>: mov 0xb8(%rsp),%rcx
=> 0x00000000004199cf <+863>: mov %rax,0x8(%rcx)
#0 0x00000000004199cf in futures_executor::thread_pool::ThreadPoolBuilder::create (self=0x7fffffffcf90)
at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:262
(gdb) x/8x $rcx
0xffffcff0: Cannot access memory at address 0xffffcff0
i.e., it is trying to write to %rcx
, which comes from %rdi
, which should contain the struct return address, which should be 0x7fffffffcff0
, but instead it is writing to 0xffffcff0
- i.e., the top 32 bits are truncated
1.2. Stack misalignment crashes. These come from calling movaps
when the stack is aligned to 8 (mod 16) bytes, with the misalignment appearing to come from tarpotest::b::{{closure}}
.
The "correct" for tarpotest::b
appears to be this:
Dump of assembler code for function tarpotest::b::{{closure}}:
0x0000000000403860 <+0>: push %rax
0x0000000000403861 <+1>: mov %rdi,(%rsp)
0x0000000000403865 <+5>: callq 0x403420 <tarpotest::b>
0x000000000040386a <+10>: callq 0x403880 <test::assert_test_result>
0x000000000040386f <+15>: pop %rax
0x0000000000403870 <+16>: retq
Dump of assembler code for function tarpotest::b:
0x0000000000403420 <+0>: sub $0x18,%rsp
0x0000000000403424 <+4>: mov %rsp,%rdi
0x0000000000403427 <+7>: callq *0x3c7cdb(%rip) # 0x7cb108
=> 0x000000000040342d <+13>: mov %rsp,%rdi
0x0000000000403430 <+16>: callq 0x4036b0 <core::ptr::real_drop_in_place>
0x0000000000403435 <+21>: add $0x18,%rsp
0x0000000000403439 <+25>: retq
Note that tarpotest::b::{{closure}}
appears later in address space, but calls tarpotest::b
. The breakpoint insertion code puts breakpoints at the start of both functions, resulting in this:
Dump of assembler code for function tarpotest::b:
0x0000000000403420 <+0>: sub $0x18,%rsp
0x0000000000403424 <+4>: int3
0x0000000000403425 <+5>: mov %esp,%edi
0x0000000000403427 <+7>: callq *0x3c7cdb(%rip) # 0x7cb108
=> 0x000000000040342d <+13>: mov %rsp,%rdi
0x0000000000403430 <+16>: callq 0x4036b0 <core::ptr::real_drop_in_place>
0x0000000000403435 <+21>: add $0x18,%rsp
0x0000000000403439 <+25>: retq
Dump of assembler code for function tarpotest::b::{{closure}}:
0x0000000000403860 <+0>: int3
0x0000000000403861 <+1>: mov %rdi,(%rsp)
0x0000000000403865 <+5>: callq 0x403420 <tarpotest::b>
0x000000000040386a <+10>: callq 0x403880 <test::assert_test_result>
0x000000000040386f <+15>: pop %rax
0x0000000000403870 <+16>: retq
tarpaulin
should take the int3 trap, get to the tarpaulin
code, then call the code with the original instruction restored.println!
to log all the ptrace events, in the non-segfaulting runs I got a ptrace
event for the breakpoint at 0x403424
, but in the segfaulting runs, I didn't. Looking at memory at 0x403424
, the 0xcc
byte was there, so the breakpoint should have been hit.
If this is indeed the case and execution somehow skips past the 0xcc
, then the mov %esp,%edi
would be executed, resulting in a stack pointer "truncated" to 32 bits. Similarly, if execution skips past the int3
at 0x403860
, then push %rax
wouldn't be executed, leading to the stack being 8 (mod 16)
.%rcx
in the segfault case is already truncated when coming out of tarpotest::b
, which increases my belief that the mov
was done as a mov %esp,%edi
.Therefore, I suspect that there are some cases in which this code "skips" an int3 event, especially if it occurs quickly after another one (4 instructions afterwards!), and proceeds with execution of the program as if it wasn't there. I don't know ptrace
well enough to know whether this can or can't happen.
So this may confirm some of my suspicions about the issue. Unfortunately, with ptrace whenever you continue or step it sends the continue signal to every thread of the process. I would expect continuing a thread on an INT3 to just stay on the INT3 and I've assumed the same for step. However, if step can increment a PC past an INT3 that could cause these issues...
If this was the case I would have expected to see these spurious failures earlier in tarpaulin's life so I'm still unsure about the cause. If that is the cause though it shouldn't be too hard to prove - in theory. I'm away from my main pc this weekend but I'll try and trial something when I get home on Sunday
So today I think I came up with a solution to this problem and potentially #207 and #35. But this is going to be a reasonable change in the state machine so may not drop this week.
Essentially, instead of getting the latest signal, handling it, collecting coverage then resuming I'm going to get all available signals and do two passes of the list.
More technical explanation for the interested below
Essentially if we have an instruction where the opcode is more than 1 byte e.g. MOV
and instrument it we write the interrupt instruction to the start of the word. Therefore MOV r/m8,r8
which is 88 /r
could become CC /r
.
Then when we get the interrupt the program counter is at the \r
so we replace CC with the original instruction, move the program counter back 1 and then execute the original instruction either with a ptrace
continue or a step.
However, there's a problem with this. If the interrupt is hit by two threads simultaneously then the old way I'd re-enable the instruction, move the first thread back and then use ptrace
to continue. But ptrace
continues and steps are applied to all threads in the process. So now we have a second thread that's in the middle of an instruction which then tries to execute an invalid instruction.
By calling waitpid until it stops giving me signals however I can build a view up of all the threads that have been trapped and when those traps would cause this issue and solve it. In theory.
This is still just a working theory that I'm in the process of coding up to test. If anyone sees any mistakes in my reasoning or has any additional ideas feel free to lend your 2 cents :+1:
Can't wait for the fix ! Meantime, I found a workaround by forcing execution to a single core using taskset -c 0 cargo tarpaulin ...
or forcing Docker to a single core using docker run --cpuset-cpus="0" ...
It's much slower obviously, but still better than having no coverage report.
Okay the fix is in develop. I've tested it some futures and rayon stuff that must have been mentioned above because it's in my issues-190 folder :+1: @Cogitri there appears to be some other issues with tmplgen, I'll figure out what they are and open another issue.
For everyone else fix is on the develop branch, please try it out and report back and if it's not fixed for everyone I'll reopen the issue
Great, thanks for looking into it! :)
I still see this on the most recent version from develop. You can look at a failed build here: https://dev.azure.com/dgriffen/nestor/_build/results?buildId=122.
@dgriffen so I've ran tarpaulin on nestor about 10 times and haven't managed to recreate the issue. Could you add --debug
to your tarpaulin call and include the debug logs?
I could reproduce it as well on the example described in https://github.com/xd009642/tarpaulin/issues/207#issuecomment-461208115 with the following logs:
[TRACE tarpaulin] Debug mode activated
[INFO tarpaulin] Running Tarpaulin
[INFO tarpaulin] Building project
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
[DEBUG tarpaulin] Processing tarpotest
[INFO tarpaulin] Launching test
[INFO tarpaulin] running /home/levans/test/tarpotest/target/debug/deps/tarpotest-c5aca069c10cd222
[TRACE tarpaulin] Source analysis for src/lib.rs
[TRACE tarpaulin] Ignorable lines: [1, 4, 5, 9]
[TRACE tarpaulin] Adding trace at address 0x403640 in src/lib.rs:2
[TRACE tarpaulin] Adding trace at address 0x403644 in src/lib.rs:3
[TRACE tarpaulin] Adding trace at address 0x403660 in src/lib.rs:7
[TRACE tarpaulin] Adding trace at address 0x403664 in src/lib.rs:8
[TRACE tarpaulin] Adding trace at address 0x403980 in src/lib.rs:7
[TRACE tarpaulin] Adding trace at address 0x403960 in src/lib.rs:2
[TRACE tarpaulin] Test PID is 16208
[TRACE tarpaulin] Caught inferior transitioning to Initialise state
[TRACE tarpaulin] Initialised inferior, transitioning to wait state
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403960
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403960
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403640
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403643
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403644
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403646
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16209
[TRACE tarpaulin] Result queue is [Stopped(Pid(16209), SIGSTOP)]
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16210
[TRACE tarpaulin] Result queue is [Stopped(Pid(16210), SIGSTOP)]
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16211
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16212
[TRACE tarpaulin] Result queue is [Stopped(Pid(16212), SIGSTOP)]
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP)]
[TRACE tarpaulin] Hit address 0x403980
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP), Stopped(Pid(16211), SIGSTOP), PtraceEvent(Pid(16209), SIGTRAP, 6)]
[TRACE tarpaulin] Hit address 0x403980
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP), PtraceEvent(Pid(16212), SIGTRAP, 6)]
[TRACE tarpaulin] Hit address 0x403660
[TRACE tarpaulin] Incrementing hit count for trace
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16211), SIGTRAP, 6)]
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGTRAP), Exited(Pid(16212), 0)]
[TRACE tarpaulin] Hit address 0x403663
[TRACE tarpaulin] Exited Pid(16212) parent Pid(16208)
[TRACE tarpaulin] Result queue is [Exited(Pid(16211), 0)]
[TRACE tarpaulin] Exited Pid(16211) parent Pid(16208)
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16210), SIGTRAP, 6)]
[TRACE tarpaulin] Child exiting
[TRACE tarpaulin] Result queue is [Exited(Pid(16209), 0)]
[TRACE tarpaulin] Exited Pid(16209) parent Pid(16208)
[TRACE tarpaulin] Result queue is [Exited(Pid(16210), 0)]
[TRACE tarpaulin] Exited Pid(16210) parent Pid(16208)
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16213
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16214
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16215
[TRACE tarpaulin] Result queue is [Stopped(Pid(16214), SIGSTOP)]
[TRACE tarpaulin] Result queue is [PtraceEvent(Pid(16208), SIGTRAP, 3)]
[TRACE tarpaulin] New thread spawned 16216
[TRACE tarpaulin] Result queue is [Stopped(Pid(16216), SIGSTOP)]
[TRACE tarpaulin] Result queue is [Stopped(Pid(16208), SIGSEGV)]
[TRACE tarpaulin] No action suggested to continue tracee. Attempting a continue
[ERROR tarpaulin] Failed to get test coverage! Error: Failed to run tests: A segfault occurred while executing tests
@xd009642 sure, just to make sure, Did you run it on the PR branch that produces the issue?
Ah no I didn't. I got thrown by the master mentioned in the azure GUI, but that's just saying what the PR would be merging into.
I've reproduced with @vberger's example so I'll use that for development and then test on nestor to make sure that I can reproduce the issue and it's fixed :+1:
Hi @xd009642, My all test cases are passed on executing "cargo test" but on executing "cargo tarpaulin -v" some of them giving an error "Error a segfault occured when executing test". I am not able to infer the reason behind this error. Please help me out ASAP so that I can continue my work.
I just felt it was necessary to comment on this :)
Please help me out ASAP so that I can continue my work.
I'm sure this is being worked on as best as possible in the limited time everyone participating in this FOSS project has, so sit tight and wait for a fix to arrive please.
Also, next time you report a bug, please include more info: What code is this being run on? What's the complete log of tarpaulin in debug mode? What version of tarpaulin are you using, and so on.
Well one pattern I've observed is that if I get a result queue of [Stopped(Pid(test_parent), SIGSTOP), PtraceEvent(Pid(child_thread), SIGTRAP, 6)]
everything's good. If I get one that's just [PtraceEvent(Pid(child_thread), SIGTRAP, 6)]
it segfaults after consistently.
I'm still trying to figure out why that might be and how I can prevent it happening.. But if anyone has any ideas from that I'm open to anything.
These threads also were all spawned in a previous test and are now only going during the next test, don't think that matters but maybe?
this is beyond what i've done in rust soz!
The current development docker image with taskset -c 0 cargo tarpaulin ...
as mentioned by @appaquet is working for me. One test is failing with a segfault (not when running cargo test
) but a report file is written.
@xd009642 Unfortunately, I'm still getting the crash with the latest 0.8.0. Here's a gist that crashes almost all the time on my machine. On my bigger project, I could get 0.7.0 to work by pinning the process to 1 core as I mentionned before, but it now crashes with 0.8.0 with or without pinning.
@xd009642 is this supposed to be fixed now?
should i be re-testing against my codebase(s)?
The changes fixed it for some people not others.
I released it before everyone was sorted because I'd gone a while workout a release and there were other features and also I'm in the process of changing job and mpoving house so things are pretty hectic right now!
Didn't realise I had a commit comment that closed this issue though!
np, thanks!
i'm still getting segfaults
getting a branch setup that has debug output in the CI to reference (using tarpaulin 0.8.0 currently)
I think there are far more minimal examples above but FWIW, I'm also experiencing segfaults on 0.8.0. Previously, I'd been running into tests timing out.
Logs with --debug
:
https://travis-ci.org/apertium/apertium-stats-service/builds/535572221#L4883
@xd009642 I'm seeing these kinds of segfaults on https://github.com/HdrHistogram/HdrHistogram_rust/pull/82 as well. To reproduce, ignore all the tests in tests/sync.rs
that start with mt_
(they take a while to run), change all fn
with fn sync_
, and run with
$ cargo tarpaulin -v -- sync --test-threads=1
to only run the tests in that file (it'd be handy to be able to give tarpaulin -t
like cargo test
). It segfaults regularly, usually either on no_block_empty_sync
or sync_record_through
. no_block_empty_sync
run in isolation does not appear to segfault, so I suspect the guess further up that the issue arises when multiple tests spawn threads is accurate.
This also happens on https://github.com/jonhoo/tracing-timing (see, for example: https://dev.azure.com/jonhoo/tracing-timing/_build/results?buildId=74&view=logs&s=0ffffa8c-a783-570b-868c-b44a3db2ace1)
I also experience segfaults both locally and on azure with my build
https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=189&view=logs
This also seems to happen in Quinn's coverage runs now that the rustc bugs crashing the test coverage compilation there has finally been fixed.
@xd009642 I think this is a pretty high priority issue. It is the thing that's preventing me from using tarpaulin in nearly all my crates at the moment.
as seen in failing builds https://circleci.com/gh/holochain/holochain-rust/1514
using
0.6.11
andnightly-2018-12-26
i've tried various flags passed to tarpaulin, but builds always seem to fail with this error
also seeing
Error a segfault occured when executing test
potentially related https://github.com/xd009642/tarpaulin/issues/35 as we are using threads