Calloop + async-process leads to hours-long test times

detly commented 2 years ago

I've written a crate calloop-subproc that uses the Calloop event loop and async-process to manage subprocesses in an event loop. It's about 700 lines of code with 9 integration tests. In order to run the tests, the test binary has to build another test binary that can serve as the "thing to run as a subprocess" part of the tests.

Without tarpaulin, the tests take about 5s including the extra binary build. Under tarpaulin, they never finish. They don't seem to hang, however — I can gradually let more and more tests finish by increasing the timeout from 1m to 2m, to 10m, to an hour. So far, however, I have not found a long enough duration for the whole set of (9) tests to finish.

I have tried to create a minimal example, but without much luck. It really seems to be a result of the following factors used together:

the Calloop event loop
the async-process crate
test-time building

...and possibly other factors?

Nonetheless, my tarp-test project sort-of shows the problem, where a usually-20s test runs in 230s under tarpaulin (including the one-off build time). It certainly doesn't show the same magnitude of slowdown that the original project shows though.

I initially thought it was due to signals for subprocess management (kill, sigchld etc.) being blocked, but the tests themselves seem to actually work, so I doubt it's that after all.

I'll keep trying to see if I can identify a particular factor that's having such a disproportionate impact.

detly commented 2 years ago

I've narrowed it down a bit and found that while some tests (in calloop-subproc) complete in 200s, others take more than an hour (I still haven't found how long):

200  test_chain_single_failure
200  test_chain_single_success
?    test_chain_multiple_success_with_failure_and_cleanup
?    test_chain_multiple_success_fail
?    test_chain_multiple_success_with_failure_and_cleanup_failure
?    test_chain_multiple_success_with_cleanup_failure
?    test_chain_multiple_success
202  test_listen_one_line
199  test_listen_three_lines

xd009642 commented 2 years ago

So, here is a chance the current PR I'm working on will greatly improve this https://github.com/xd009642/tarpaulin/pull/962 if you want to try it out. It changes that part of the code a lot and right now I'm just fighting with one intermittently failing test on nightly. If you're interested in some of the why then this comment explains a lot of the details https://github.com/xd009642/tarpaulin/issues/953#issuecomment-1059982528

Once I've got this PR done and merged I'll look at this issue and see what the difference is (or maybe before, just once I've got the flaky test sorted). And hopefully come back to this issue with some updates

detly commented 2 years ago

Okay, I managed to substantially reduce the repro, it seems like having multiple async process spawns, even if they're sequential, just totally stops under tarpaulin. I get this output from the test repo linked above:

tarp-test ⚬ cargo tarpaulin --ignore-tests --timeout 120 --forward -- --nocapture
Mar 08 17:35:49.125  INFO cargo_tarpaulin::config: Creating config
Mar 08 17:35:49.151  INFO cargo_tarpaulin: Running Tarpaulin
Mar 08 17:35:49.151  INFO cargo_tarpaulin: Building project
Mar 08 17:35:49.151  INFO cargo_tarpaulin::cargo: Cleaning project
   Compiling [etc etc]
    Finished test [unoptimized + debuginfo] target(s) in 27.73s
Mar 08 17:36:16.969  INFO cargo_tarpaulin::process_handling::linux: Launching test
Mar 08 17:36:16.969  INFO cargo_tarpaulin::process_handling: running /home/jason/Code/tarp-test/target/debug/deps/test_one-036ad9d4be9c1c80

running 1 test
Spawning first process
Awaiting first process
test test_one has been running for over 60 seconds
Mar 08 17:38:21.793 ERROR cargo_tarpaulin: Failed to get test coverage! Error: Failed to run tests: Error: Timed out waiting for test response
Error: "Failed to get test coverage! Error: Failed to run tests: Error: Timed out waiting for test response"

detly commented 2 years ago

Oops, crossed signals there. I'll give that branch a try.

detly commented 2 years ago

Didn't seem to change things I'm afraid. I'm using 9543cb27292c24b1290ce5c61f419663eede2b52 and get:

tarp-test ⚬ ../tarpaulin/target/release/cargo-tarpaulin tarpaulin --ignore-tests --forward --timeout 120 -- --nocapture
Mar 08 18:17:08.440  INFO cargo_tarpaulin::config: Creating config
Mar 08 18:17:08.547  INFO cargo_tarpaulin: Running Tarpaulin
Mar 08 18:17:08.547  INFO cargo_tarpaulin: Building project
Mar 08 18:17:08.547  INFO cargo_tarpaulin::cargo: Cleaning project
   Compiling libc v0.2.119
   Compiling futures-core v0.3.21
   Compiling log v0.4.14
   Compiling cfg-if v1.0.0
   Compiling memchr v2.4.1
   Compiling autocfg v1.1.0
   Compiling pin-project-lite v0.2.8
   Compiling futures-task v0.3.21
   Compiling slab v0.4.5
   Compiling signal-hook v0.3.13
   Compiling cache-padded v1.2.0
   Compiling waker-fn v1.1.0
   Compiling futures-util v0.3.21
   Compiling parking v2.0.0
   Compiling fastrand v1.7.0
   Compiling futures-io v0.3.21
   Compiling pin-utils v0.1.0
   Compiling once_cell v1.10.0
   Compiling bitflags v1.3.2
   Compiling event-listener v2.5.2
   Compiling concurrent-queue v1.2.2
   Compiling memoffset v0.6.5
   Compiling futures-lite v1.12.0
   Compiling socket2 v0.4.4
   Compiling signal-hook-registry v1.4.0
   Compiling polling v2.2.0
   Compiling nix v0.22.3
   Compiling async-io v1.6.0
   Compiling async-process v1.3.0
   Compiling calloop v0.9.3
   Compiling tarp-test v0.1.0 (/home/jason/Code/tarp-test)
    Finished test [unoptimized + debuginfo] target(s) in 25.74s
Mar 08 18:17:34.565  INFO cargo_tarpaulin::process_handling::linux: Launching test
Mar 08 18:17:34.565  INFO cargo_tarpaulin::process_handling: running /home/jason/Code/tarp-test/target/debug/deps/test_one-036ad9d4be9c1c80

running 1 test
Spawning first process
Awaiting first process
test test_one has been running for over 60 seconds
Mar 08 18:19:37.765 ERROR cargo_tarpaulin: Failed to get test coverage! Error: Failed to run tests: Error: Timed out waiting for test response
Error: "Failed to get test coverage! Error: Failed to run tests: Error: Timed out waiting for test response"

xd009642 commented 1 year ago

So this should now be a lot better with --engine llvm for llvm coverage instrumentation which was released during 2022. If not feel free to reopen and I'll look again

detly commented 1 year ago

Just confirming, all I had to do was add that flag and it works perfectly in my example project.

xd009642 / tarpaulin

Calloop + async-process leads to hours-long test times #968