speed up testing via parallelization/sharding

softminus commented 2 years ago

on a 16-core, otherwise idle GCP VM:

time ./src/test/test_bitcoin -p
Running 408 test cases...

*** No errors detected

real    20m26.580s
user    82m47.169s
sys     1m19.600s

Total user time is (82*60+47.169), wall clock time is (20*60+26.580), which tells us that on average, our machine is grossly under-utilized: (82*60+47.169) / (16 * (20*60+26.580)) = 25.3%.

Likewise, for gtests on the same machine, run with time ./src/zcash-gtest:

real    6m18.060s
user    41m19.423s
sys     0m14.144s

Doing the same computation: (41*60 + 19.423)/ (16*(6*60+18.060)) = 40.9% which is significantly better but still has much room for improvement. Presumably here we are spending more time doing proofs (which parallelize automatically over multiple threads) rather than doing single-threaded work.

bitcoin-core runs their unit tests in parallel with make managing the parallelism (AFAICT make calls out to the unit test runner binary and tells it the name of a single test to run, and runs multiple of those at once when the -j option is given) : https://github.com/bitcoin/bitcoin/pull/12926

@str4d raised the issue in #5497 that running multiple processes that are completing proofs on a single machine is likely to lead to performance degradation due to contention (since they'll each want to use all the cores for their threads). i don't know if we have code to quantify the amount of performance degradation due to contention.

There's several options i see for handling this, though i think we need to have some numbers for how badly the contention affects wall-time of test execution to determine how badly we need mitigations and to determine how effective the mitigations are:

1) Non-adaptive test sharding (dividing all the tests into multiple test groups) and running different groups of tests on different machines or one group after another on the same machine: https://github.com/zcash/zcash/issues/5497#issue-1112892509

2) Trickery with nice values / priority / cgroups / process groups / etc to give one set (to first order, i dont think it matters which) of proof threads priority over the others

3) Adaptively disallowing more than one proof process from running at once: run different processes/tests in different cgroups, if we notice one taking up all cores for a while, freeze the others until that proof is done.

4) Throw a very-high core count VM at the problem (and make sure it gets spun down after the tests are done running to avoid paying for idle instances)

softminus commented 2 years ago

On the same 16-core machine, we get:

$ time ./qa/pull-tester/rpc-tests.py
[...]
ALL                                | True   | 11394 s (accumulated)

Runtime: 3442 s

real    59m10.026s
user    614m8.082s
sys     7m22.759s

Adding a -j16 improves this by 9 minutes:

$ time ./qa/pull-tester/rpc-tests.py -j16
[...]
ALL                                | True   | 33838 s (accumulated)

Runtime: 2909 s

real    50m14.014s
user    646m28.322s
sys     9m21.876s

so even with naive parallelization there's some gains that don't offset the increased overhead (sys time increasing by two minutes).

softminus commented 2 years ago

adding timings for rpc-tests on a 224-core GCP machine:

sasha_work@massive-machine-test:~/zcash$ time ./qa/pull-tester/rpc-tests.py
[...]
ALL                                | False  | 7684 s (accumulated)

Runtime: 2396 s

real    39m56.799s
user    2907m1.599s
sys     81m6.784s

giving us user/real = 74

and with -j224:

sasha_work@massive-machine-test:~/zcash$ time ./qa/pull-tester/rpc-tests.py -j224
[...]
ALL                                | True   | 39939 s (accumulated)

Runtime: 1338 s

real    24m2.643s
user    2814m9.037s
sys     58m22.905s

we get user/real = 117 which is better but nowhere near 224.

str4d commented 2 years ago

Presumably here we are spending more time doing proofs (which parallelize automatically over multiple threads) rather than doing single-threaded work.

Yes; this is partly an artifact of their history. src/test/test_bitcoin is the Boost test runner inherited from Bitcoin, and has all of their tests. We added our own tests there at first, some of which created proofs, but most of our tests were built into src/zcash-gtest, which uses the GoogleTest framework we added as our own dependency. Ergo most of the tests that are internally parallelized are in zcash-gtest.

str4d commented 2 years ago

I suspect the solution here will be similar to #5497: we should group the internally-parallelized tests separately from the externally-parallelized tests. I expect the simplest way to achieve this would be to port all of our shielded-heavy tests from Boost to GoogleTest, and then keep src/test/test_bitcoin for Bitcoin-style single-threaded tests (using their parallelization strategy which we can backport from upstream). After this, we probably want to break up zcash-gtest into parallelism-heavy and parallelism-light tests as well (which we can easily do without interfering with upstream backports).

softminus commented 2 years ago

before i dive into doing this; do we have any ideas / extant code for effectively profiling these tests (or profiling zcashd RPC calls or librustzcash operations)?

mdr0id commented 2 years ago

Here are the requested timings from the bare-metal threadripper machine:

time ./zcash/qa/pull-tester/rpc-tests.py -j64

ALL                                | True   | 50258 s (accumulated)

Runtime: 1585 s

real    27m53.943s
user    1214m56.712s
sys 17m51.716s

time ./zcash/src/test/test_bitcoin -p
Running 408 test cases...

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

*** No errors detected

real    16m23.207s
user    203m13.138s
sys 2m35.408s

time ./zcash/qa/zcash/full_test_suite.py gtest

[----------] Global test environment tear-down
[==========] 243 tests from 41 test cases ran. (238845 ms total)
[  PASSED  ] 243 tests.

  YOU HAVE 1 DISABLED TEST

--------------------
Finished stage gtest

real    4m7.742s
user    66m51.733s
sys 0m45.578s

mdr0id commented 2 years ago

There's several options i see for handling this, though i think we need to have some numbers for how badly the contention affects wall-time of test execution to determine how badly we need mitigations and to determine how effective the mitigations are:

1. Non-adaptive test sharding (dividing all the tests into multiple test groups) and running different groups of tests on different machines or one group after another on the same machine: [Add "test groups" to `rpc-tests.py` #5497 (comment)](https://github.com/zcash/zcash/issues/5497#issue-1112892509)

2. Trickery with `nice` values / priority / cgroups / process groups / etc to give one set (to first order, i dont think it matters which) of proof threads priority over the others

3. Adaptively disallowing more than one proof process from running at once: run different processes/tests in different cgroups, if we notice one taking up all cores for a while, freeze the others until that proof is done.

4. Throw a very-high core count VM at the problem (and make sure it gets spun down after the tests are done running to avoid paying for idle instances)

Should be handled by some CI system orchestrating the repo's scripts
These might have some implementation issues across cloud resources but definitely feasible
These might have some implementation issues across cloud resources but definitely feasible
Requesting large machines on certain cloud providers is not always reliable depending on a few factors. Additionally, we'd need to pay attention to the cost of these since often times they don't provide much speed up compared to bare-metal rigs at a certain price point.

softminus commented 2 years ago

nothing this for further reference, when we backport bitcoin core's btest parallelization thing (which seems highly worth it given there'll be no internally-parallelized (proof) tests in btests quite soon) we'll probably need to make some changes to our WalletTestingSetup.

Currently, our WalletTestingSetup does:

pwalletMain = new CWallet(Params(), "wallet_test.dat");

and it seems like multiple instances are likely to trample over each other if they're using that common file name.

Bitcoin Core currently seems to call out to CreateMockWalletDatabase(), which seems to be something that allows creation of temporary in-memory databases (either sqlite or bdb): https://github.com/bitcoin/bitcoin/blob/267917f5632a99bb51fc3fe516d8308e79d31ed1/src/wallet/walletdb.cpp#L1189-L1196

softminus commented 2 years ago

posting this as notes for btest parallelization (if anyone wants to chime in that's appreciated but not requested):

so the parallelization scheme (https://github.com/bitcoin/bitcoin/blob/master/src/Makefile.test.include#L386-L388) that bitcoin core uses for their btests doesn't work with an unmodified zcash codebase:

we have test suites that are #if'd out but the grep still finds them and it causes boost's test runner code to error out, doing something silly like adding a keyword to disabled test suites and grep -v'ing doesn't work because it feeds an empty argument to the test runner which also causes an error
also it doesn't seem to work when there's more than one test suite in a single file, our current version of boost tests only runs the first suite

softminus commented 2 years ago

btest parallelization work in progress lives here: https://github.com/superbaud/zcash/commits/parallel-btests

zcash / zcash

speed up testing via parallelization/sharding #5501