zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.48k stars 6.41k forks source link

Unit test fragility on qemu_cortex_a9 involving net management events, conditional on presence of other platforms #73102

Open glarsennordic opened 3 months ago

glarsennordic commented 3 months ago

Describe the bug

The following twister command fails approximately 3% of the time when executed on my machine (detailed below):

./scripts/twister --ninja --inline-logs --overflow-as-errors -T tests/net/conn_mgr_monitor

When it fails, the platform which failed is always qemu_cortex_a9.

Bafflingly, if we restrict the twister command to just this platform, the 3% error rate dissapears:

./scripts/twister --ninja --inline-logs --overflow-as-errors -p qemu_cortex_a9 -T tests/net/conn_mgr_monitor

The error experienced 3% of the time is usually due to either too few or too many NET Management events being generated.

ERROR   - qemu_cortex_a9            tests/net/conn_mgr_monitor/net.conn_mgr.dad         FAILED : Failed
INFO    - /home/gela/ncs/zcmv46/zephyr/twister-out/qemu_cortex_a9/tests/net/conn_mgr_monitor/net.conn_mgr.dad/handler.log
ERROR   - *** Booting Zephyr OS build zephyr-v3.5.0-4816-g404db20877d9 ***
Running TESTSUITE conn_mgr_monitor
===================================================================
START - test_DAD
 PASS - test_DAD in 0.121 seconds
===================================================================
START - test_cycle_ready_CC
 PASS - test_cycle_ready_CC in 0.013 seconds
===================================================================
START - test_cycle_ready_CIC
 PASS - test_cycle_ready_CIC in 0.026 seconds
===================================================================
START - test_cycle_ready_CINC
 PASS - test_cycle_ready_CINC in 0.026 seconds
===================================================================
START - test_cycle_ready_CNC
 PASS - test_cycle_ready_CNC in 0.012 seconds
===================================================================
START - test_cycle_ready_NCC
 PASS - test_cycle_ready_NCC in 0.012 seconds
===================================================================
START - test_cycle_ready_NCIC
 PASS - test_cycle_ready_NCIC in 0.026 seconds
===================================================================
START - test_cycle_ready_NCINC
 PASS - test_cycle_ready_NCINC in 0.026 seconds
===================================================================
START - test_cycle_ready_NCNC
 PASS - test_cycle_ready_NCNC in 0.012 seconds
===================================================================
START - test_cycle_states_connected_ipv46
 PASS - test_cycle_states_connected_ipv46 in 0.244 seconds
===================================================================
START - test_cycle_states_connected_ipv64

    Assertion failed at WEST_TOPDIR/zephyr/tests/net/conn_mgr_monitor/src/main.c:521: cycle_iface_states: (stats.dconn_count not equal to 1)
NET_EVENT_L4_DISCONNECTED should be fired when connectivity is lost.
 FAIL - test_cycle_states_connected_ipv64 in 0.130 seconds
===================================================================
START - test_cycle_states_simple_ipv46
 PASS - test_cycle_states_simple_ipv46 in 0.254 seconds
===================================================================
START - test_cycle_states_simple_ipv64
 PASS - test_cycle_states_simple_ipv64 in 0.244 seconds
===================================================================
START - test_ignore_while_ready
 PASS - test_ignore_while_ready in 0.010 seconds
===================================================================
START - test_ignores
 PASS - test_ignores in 0.003 seconds
===================================================================
TESTSUITE conn_mgr_monitor failed.

(NOTE: The logs here indicate an older Zephyr commit hash than latest; This is because these logs are copied from some tests where I tried older commits. But this appears to affect all commits of Zephyr since I introduced this test suite)

This strongly suggests some kind of bug with our QEMU simulation environment, but to be frank I'm at a loss as to what that could possibly be. I've tried maximizing the delays between event triggers in these tests and the event verifiers to give events maximal chances of settling, but to no avail.

I cannot fathom why, 3% of the time, unexpected events get triggered, or events which are expected are not triggered, regardless of delay, but ONLY if I also execute tests for other platforms. I suspect that network state from prior QEMU test executions might be affecting the initial network state for qemu_cortex_a9.

To Reproduce Clone and west update the latest Zephyr. Or use 76559f27fd6e9219516c9ee7deebbdf5b3116105 for my exact environment. From the zephyr root directory, execute the following command (on linux): for i in {1..100}; do rm -r twister-out*; ./scripts/twister --ninja --inline-logs --overflow-as-errors -T tests/net/conn_mgr_monitor done

There is a:

Expected behavior I would expect this test to succeed 100% of the time, instead of 97% of the time. I would also expect whether or not this test fails on qemu_cortex_a9 to not depend on whether other platforms are enabled too.

Impact Largely, this is an annoyance. But I find the inconsistency with how qemu_cortex_a9 behaves somewhat concerning. It suggests something might be wrong with QEMU.

Environment:

System:
  Kernel: 6.5.0-28-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 42.9
    Distro: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Machine:
  Type: Laptop System: Dell product: Latitude 5400 v: N/A
  Mobo: Dell model: 0PD9KD v: A00 UEFI: Dell v: 1.9.1 date: 07/06/2020
CPU:
  Info: quad core model: Intel Core i7-8665U bits: 64 type: MT MCP

I am using Zephyr SDK 0.16.5 (zephyr-sdk-0.16.5-1_linux-x86_64.tar.xz)

github-actions[bot] commented 1 month ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.