zcash / zcash

Zcash - Internet Money
https://z.cash/
Other
4.94k stars 2.04k forks source link

WIP: RPC Test Intermittent Failures #4276

Open zancas opened 4 years ago

zancas commented 4 years ago

Describe the issue

Several RPC tests intermittently fail including, at least:

Can you reliably reproduce the issue?

If so, please list the steps to reproduce below:

  1. run the relevant test For example: qa/pull-tester/rpc-tests.sh wallet_listnotes

This file, detect_race_run_log.txt shows the result of a loop across ~ 45 iterations of WalletListNotes

Expected behaviour

Each test should pass each time it is run.

Actual behaviour + errors

See the attached file above where detect_race.py fails in ~ 5/44 runs. Note: detect_race.py is just wallet_listnotes.py with lines deleted.

The version of Zcash you were using:

Zcash Daemon version v2.1.0-1-1d2c747af

In order to ensure you are adequately protecting your privacy when using Zcash,
please see <https://z.cash/support/security/>.

Copyright (C) 2009-2019 The Bitcoin Core Developers
Copyright (C) 2015-2019 The Zcash Developers

This is experimental software.

Distributed under the MIT software license, see the accompanying file COPYING
or <https://www.opensource.org/licenses/mit-license.php>.

This product includes software developed by the OpenSSL Project for use in the
OpenSSL Toolkit <https://www.openssl.org/> and cryptographic software written
by Eric Young.

Machine specs:

Any extra information that might be useful in the debugging process.

This development/build/test environment is a docker container

Do you have a backup of ~/.zcash directory and/or take a VM snapshot?

zancas commented 4 years ago

My tests show an increase in intermittent failures in both Python3 ports, relative to the Python2 version (1% -> 10% ;; n =100).

A plausible explanation: The Racey Tests Hypothesis:

Observations reported by @mdr0id: The observed failure rate drops when the node is run on a well-resourced systems, and the test code contains constant-wait times in polling loops.

zancas commented 4 years ago

Rather than publish experimental results now, I will focus on understanding the architecture of operationid producing and consuming, state-transitions. I hypothesize that these transitions map to asynchronous transitions.

If anyone wishes to analyze/compare data, (perhaps because they are skeptical of the hypothesis that the faliures are due to inconsistencies within zcashd nodes) I'm happy to curate and publish. For now I assume that this model/hypothesis is understood and agreed upon.

mdr0id commented 4 years ago

It would be of value to also note if this was on the native host or within a container/VM environment.

zancas commented 4 years ago

From str4d via rocketchat:

Yes
async ops were introduced by us, because creating Sprout transactions was slow
They are only used for shielded transaction creation
They are not generic asynchronous logic within the node
It's solely a mechanism for implementing an asynchronous RPC method