Closed softminus closed 2 years ago
On the same 16-core machine, we get:
$ time ./qa/pull-tester/rpc-tests.py
[...]
ALL | True | 11394 s (accumulated)
Runtime: 3442 s
real 59m10.026s
user 614m8.082s
sys 7m22.759s
Adding a -j16
improves this by 9 minutes:
$ time ./qa/pull-tester/rpc-tests.py -j16
[...]
ALL | True | 33838 s (accumulated)
Runtime: 2909 s
real 50m14.014s
user 646m28.322s
sys 9m21.876s
so even with naive parallelization there's some gains that don't offset the increased overhead (sys
time increasing by two minutes).
adding timings for rpc-tests on a 224-core GCP machine:
sasha_work@massive-machine-test:~/zcash$ time ./qa/pull-tester/rpc-tests.py
[...]
ALL | False | 7684 s (accumulated)
Runtime: 2396 s
real 39m56.799s
user 2907m1.599s
sys 81m6.784s
giving us user/real = 74
and with -j224
:
sasha_work@massive-machine-test:~/zcash$ time ./qa/pull-tester/rpc-tests.py -j224
[...]
ALL | True | 39939 s (accumulated)
Runtime: 1338 s
real 24m2.643s
user 2814m9.037s
sys 58m22.905s
we get user/real = 117
which is better but nowhere near 224.
Presumably here we are spending more time doing proofs (which parallelize automatically over multiple threads) rather than doing single-threaded work.
Yes; this is partly an artifact of their history. src/test/test_bitcoin
is the Boost test runner inherited from Bitcoin, and has all of their tests. We added our own tests there at first, some of which created proofs, but most of our tests were built into src/zcash-gtest
, which uses the GoogleTest framework we added as our own dependency. Ergo most of the tests that are internally parallelized are in zcash-gtest
.
I suspect the solution here will be similar to #5497: we should group the internally-parallelized tests separately from the externally-parallelized tests. I expect the simplest way to achieve this would be to port all of our shielded-heavy tests from Boost to GoogleTest, and then keep src/test/test_bitcoin
for Bitcoin-style single-threaded tests (using their parallelization strategy which we can backport from upstream). After this, we probably want to break up zcash-gtest
into parallelism-heavy and parallelism-light tests as well (which we can easily do without interfering with upstream backports).
before i dive into doing this; do we have any ideas / extant code for effectively profiling these tests (or profiling zcashd
RPC calls or librustzcash
operations)?
Here are the requested timings from the bare-metal threadripper machine:
time ./zcash/qa/pull-tester/rpc-tests.py -j64
ALL | True | 50258 s (accumulated)
Runtime: 1585 s
real 27m53.943s
user 1214m56.712s
sys 17m51.716s
time ./zcash/src/test/test_bitcoin -p
Running 408 test cases...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
*** No errors detected
real 16m23.207s
user 203m13.138s
sys 2m35.408s
time ./zcash/qa/zcash/full_test_suite.py gtest
[----------] Global test environment tear-down
[==========] 243 tests from 41 test cases ran. (238845 ms total)
[ PASSED ] 243 tests.
YOU HAVE 1 DISABLED TEST
--------------------
Finished stage gtest
real 4m7.742s
user 66m51.733s
sys 0m45.578s
There's several options i see for handling this, though i think we need to have some numbers for how badly the contention affects wall-time of test execution to determine how badly we need mitigations and to determine how effective the mitigations are:
1. Non-adaptive test sharding (dividing all the tests into multiple test groups) and running different groups of tests on different machines or one group after another on the same machine: [Add "test groups" to `rpc-tests.py` #5497 (comment)](https://github.com/zcash/zcash/issues/5497#issue-1112892509) 2. Trickery with `nice` values / priority / cgroups / process groups / etc to give one set (to first order, i dont think it matters which) of proof threads priority over the others 3. Adaptively disallowing more than one proof process from running at once: run different processes/tests in different cgroups, if we notice one taking up all cores for a while, freeze the others until that proof is done. 4. Throw a very-high core count VM at the problem (and make sure it gets spun down after the tests are done running to avoid paying for idle instances)
nothing this for further reference, when we backport bitcoin core's btest parallelization thing (which seems highly worth it given there'll be no internally-parallelized (proof) tests in btests quite soon) we'll probably need to make some changes to our WalletTestingSetup
.
Currently, our WalletTestingSetup
does:
pwalletMain = new CWallet(Params(), "wallet_test.dat");
and it seems like multiple instances are likely to trample over each other if they're using that common file name.
Bitcoin Core currently seems to call out to CreateMockWalletDatabase()
, which seems to be something that allows creation of temporary in-memory databases (either sqlite or bdb): https://github.com/bitcoin/bitcoin/blob/267917f5632a99bb51fc3fe516d8308e79d31ed1/src/wallet/walletdb.cpp#L1189-L1196
posting this as notes for btest parallelization (if anyone wants to chime in that's appreciated but not requested):
so the parallelization scheme (https://github.com/bitcoin/bitcoin/blob/master/src/Makefile.test.include#L386-L388) that bitcoin core uses for their btests doesn't work with an unmodified zcash codebase:
#if
'd out but the grep still finds them and it causes boost's test runner code to error out, doing something silly like adding a keyword to disabled test suites and grep -v
'ing doesn't work because it feeds an empty argument to the test runner which also causes an errorbtest parallelization work in progress lives here: https://github.com/superbaud/zcash/commits/parallel-btests
on a 16-core, otherwise idle GCP VM:
Total user time is
(82*60+47.169)
, wall clock time is(20*60+26.580)
, which tells us that on average, our machine is grossly under-utilized:(82*60+47.169) / (16 * (20*60+26.580)) = 25.3%
.Likewise, for
gtests
on the same machine, run withtime ./src/zcash-gtest
:Doing the same computation:
(41*60 + 19.423)/ (16*(6*60+18.060)) = 40.9%
which is significantly better but still has much room for improvement. Presumably here we are spending more time doing proofs (which parallelize automatically over multiple threads) rather than doing single-threaded work.bitcoin-core runs their unit tests in parallel with
make
managing the parallelism (AFAICTmake
calls out to the unit test runner binary and tells it the name of a single test to run, and runs multiple of those at once when the-j
option is given) : https://github.com/bitcoin/bitcoin/pull/12926@str4d raised the issue in #5497 that running multiple processes that are completing proofs on a single machine is likely to lead to performance degradation due to contention (since they'll each want to use all the cores for their threads). i don't know if we have code to quantify the amount of performance degradation due to contention.
There's several options i see for handling this, though i think we need to have some numbers for how badly the contention affects wall-time of test execution to determine how badly we need mitigations and to determine how effective the mitigations are:
1) Non-adaptive test sharding (dividing all the tests into multiple test groups) and running different groups of tests on different machines or one group after another on the same machine: https://github.com/zcash/zcash/issues/5497#issue-1112892509
2) Trickery with
nice
values / priority / cgroups / process groups / etc to give one set (to first order, i dont think it matters which) of proof threads priority over the others3) Adaptively disallowing more than one proof process from running at once: run different processes/tests in different cgroups, if we notice one taking up all cores for a while, freeze the others until that proof is done.
4) Throw a very-high core count VM at the problem (and make sure it gets spun down after the tests are done running to avoid paying for idle instances)