WIP: RPC Test Intermittent Failures

zancas commented 4 years ago

Describe the issue

Several RPC tests intermittently fail including, at least:

wallet_listnotes.py
wallet_listreceived.py
sprout_sapling_migration.py

Can you reliably reproduce the issue?

If so, please list the steps to reproduce below:

run the relevant test For example: qa/pull-tester/rpc-tests.sh wallet_listnotes

This file, detect_race_run_log.txt shows the result of a loop across ~ 45 iterations of WalletListNotes

Expected behaviour

Each test should pass each time it is run.

Actual behaviour + errors

See the attached file above where detect_race.py fails in ~ 5/44 runs. Note: detect_race.py is just wallet_listnotes.py with lines deleted.

The version of Zcash you were using:

Zcash Daemon version v2.1.0-1-1d2c747af

In order to ensure you are adequately protecting your privacy when using Zcash,
please see <https://z.cash/support/security/>.

Copyright (C) 2009-2019 The Bitcoin Core Developers
Copyright (C) 2015-2019 The Zcash Developers

This is experimental software.

Distributed under the MIT software license, see the accompanying file COPYING
or <https://www.opensource.org/licenses/mit-license.php>.

This product includes software developed by the OpenSSL Project for use in the
OpenSSL Toolkit <https://www.openssl.org/> and cryptographic software written
by Eric Young.

Machine specs:

OS name + version: Debian + bullseye/sid
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
RAM: 16113544 kB
Disk size: 1 TeraByte
Disk Type (HD/SDD): SSD
Linux kernel version (uname -a): Linux 7b91103205a5 5.4.6-arch1-1 #1 SMP PREEMPT Sat, 21 Dec 2019 16:34:41 +0000 x86_64 GNU/Linux

Compiler version (gcc --version):

gcc --version
gcc (Debian 9.2.1-8) 9.2.1 20190909
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Linker version (ld -v): GNU ld (GNU Binutils for Debian) 2.32.51.20190909

Assembler version (as --version):

GNU assembler (GNU Binutils for Debian) 2.32.51.20190909
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'.

Any extra information that might be useful in the debugging process.

This development/build/test environment is a docker container

Do you have a backup of `~/.zcash` directory and/or take a VM snapshot?

Backing up / making a copy of the ~/.zcash directory might help make the problem reproducible. Please redact appropriately.
Taking a VM snapshot is really helpful for interactively testing fixes

zancas commented 4 years ago

My tests show an increase in intermittent failures in both Python3 ports, relative to the Python2 version (1% -> 10% ;; n =100).

A plausible explanation: The Racey Tests Hypothesis:

Logic in the test code detects races between components of a zcashd node.

Observations reported by @mdr0id: The observed failure rate drops when the node is run on a well-resourced systems, and the test code contains constant-wait times in polling loops.

zancas commented 4 years ago

Rather than publish experimental results now, I will focus on understanding the architecture of operationid producing and consuming, state-transitions. I hypothesize that these transitions map to asynchronous transitions.

If anyone wishes to analyze/compare data, (perhaps because they are skeptical of the hypothesis that the faliures are due to inconsistencies within zcashd nodes) I'm happy to curate and publish. For now I assume that this model/hypothesis is understood and agreed upon.

mdr0id commented 4 years ago

It would be of value to also note if this was on the native host or within a container/VM environment.

zancas commented 4 years ago

From str4d via rocketchat:

Yes
async ops were introduced by us, because creating Sprout transactions was slow
They are only used for shielded transaction creation
They are not generic asynchronous logic within the node
It's solely a mechanism for implementing an asynchronous RPC method

zcash / zcash