zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

bsim mesh establish_multi.sh doesn't send data for one of devices #48514

Closed alxelax closed 2 years ago

alxelax commented 2 years ago

Describe the bug The bug is reproduce only if extended advertiser is chosen for mesh (the legacy adv works well). Test scenario includes 6 emulated devices: 1 mesh friend node and 5 mesh low power nodes. Test should start all devices simultaneously. 5 low power nodes should establish friendship with friend node in one turn. Wrong external behavior: 4th node always cannot establish friendship. It doesn't matter in which sequence emulated low power nodes started their Friend request frames. 4th node may send Friend Request the first one, in the middle or the last one (the randomness is provided by babblesim engine) but result is always the same all emulated nodes can establish friendship but the 4th node cannot.

Test passes only if there are 3 low power nodes if there are 4 or 5 nodes the 4th node is always failed.

if verbosity level is set at 7 (maximum?) the babblesim engine shows that all emulated low power nodes requested sending frame to emulated radio and got access to it. Even more, 4th node behaves itself as another nodes regarding reported statuses from Bluetooth controller and reports successful frame sending. The friend node doesn't see this frame.

If one more additional emulated device that reproduces BLE scanner to run in parallel it cannot see 4th node frames either. Seems 4th device actually doesn't emit frame in emulated radio channel.

The same issue is reproduced on two different PCs (no influence of PC performance).

Preliminary opinion: It seems babblesim engine has bug during frame sending in emulated radio channel. Babblesim model and modem are default ones.

To Reproduce Steps to reproduce the behavior:

  1. checkout Zephyr main
  2. set option CONFIG_BT_EXT_ADV=y in https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/bluetooth/bsim_bt/bsim_test_mesh/prj.conf
  3. build babble sim mesh tests (it is possible to build only this: https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/bluetooth/bsim_bt/compile.sh#L56)
  4. run full set bsim mesh tests or only establish_multi.sh scenario
  5. See error in shown log. Issue looks like the 4th device cannot establish friendship due to absence relevant offer from friend node.

Expected behavior establish_multi.sh should pass for both advertisers extended and legacy ones.

Impact Not possible to make extended advertiser as default one and pass Zephyr's CI. The legacy advertiser has very poor performance from mesh relay feature point of view.

Logs and console output ``` alkh-local@alkh-linux:~/ncs/zephyr/tests/bluetooth/bsim_bt/bsim_test_mesh/tests_scripts/friendship$ bash establish_multi.sh Starting friendship_friend_est_multi as device #0 Starting friendship_lpn_est as device #1 Starting friendship_lpn_est as device #2 Starting friendship_lpn_est as device #3 Starting friendship_lpn_est as device #4 Starting friendship_lpn_est as device #5 Starting phy with 6 devices d_00: @00:00:00.000000 Booting Zephyr OS build zephyr-v3.1.0-2520-g702b6843669d d_01: @00:00:00.000000 Booting Zephyr OS build zephyr-v3.1.0-2520-g702b6843669d d_02: @00:00:00.000000 Booting Zephyr OS build zephyr-v3.1.0-2520-g702b6843669d d_03: @00:00:00.000000 Booting Zephyr OS build zephyr-v3.1.0-2520-g702b6843669d d_04: @00:00:00.000000 Booting Zephyr OS build zephyr-v3.1.0-2520-g702b6843669d d_05: @00:00:00.000000 Booting Zephyr OS build zephyr-v3.1.0-2520-g702b6843669d d_00: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) d_00: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Variant: nRF52x (0x0002) d_00: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99 d_00: @00:00:00.005768 [00:00:00.005,767] bt_id: No static addresses stored in controller d_01: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) d_01: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Variant: nRF52x (0x0002) d_01: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99 d_01: @00:00:00.005768 [00:00:00.005,767] bt_id: No static addresses stored in controller d_00: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: Identity: C5:AD:ED:38:00:B2 (random) d_00: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x0059 d_00: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff d_00: @00:00:00.006488 [00:00:00.006,469] mesh_test: Bluetooth initialized d_00: @00:00:00.006488 [00:00:00.006,469] mesh_test: Mesh initialized d_00: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: Primary Element: 0x0001 d_00: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000 d_01: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: Identity: FF:21:3D:E5:B4:CC (random) d_01: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x0059 d_01: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff d_01: @00:00:00.006488 [00:00:00.006,469] mesh_test: Bluetooth initialized d_01: @00:00:00.006488 [00:00:00.006,469] mesh_test: Mesh initialized d_01: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: Primary Element: 0x0004 d_01: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000 d_02: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) d_02: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Variant: nRF52x (0x0002) d_02: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99 d_03: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) d_03: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Variant: nRF52x (0x0002) d_02: @00:00:00.005768 [00:00:00.005,767] bt_id: No static addresses stored in controller d_03: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99 d_04: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) d_04: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Variant: nRF52x (0x0002) d_03: @00:00:00.005768 [00:00:00.005,767] bt_id: No static addresses stored in controller d_04: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99 d_05: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) d_05: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: HW Variant: nRF52x (0x0002) d_05: @00:00:00.005768 [00:00:00.005,767] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99 d_04: @00:00:00.005768 [00:00:00.005,767] bt_id: No static addresses stored in controller d_02: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: Identity: D2:B4:48:CE:BC:8A (random) d_02: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x0059 d_02: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff d_02: @00:00:00.006488 [00:00:00.006,469] mesh_test: Bluetooth initialized d_03: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: Identity: C4:C9:B5:FE:AB:8A (random) d_03: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x0059 d_03: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff d_03: @00:00:00.006488 [00:00:00.006,469] mesh_test: Bluetooth initialized d_02: @00:00:00.006488 [00:00:00.006,469] mesh_test: Mesh initialized d_02: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: Primary Element: 0x0005 d_02: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000 d_03: @00:00:00.006488 [00:00:00.006,469] mesh_test: Mesh initialized d_03: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: Primary Element: 0x0006 d_03: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000 d_05: @00:00:00.005768 [00:00:00.005,767] bt_id: No static addresses stored in controller d_04: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: Identity: C6:9F:2E:65:E7:0C (random) d_04: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x0059 d_04: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff d_04: @00:00:00.006488 [00:00:00.006,469] mesh_test: Bluetooth initialized d_04: @00:00:00.006488 [00:00:00.006,469] mesh_test: Mesh initialized d_04: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: Primary Element: 0x0007 d_04: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000 d_05: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: Identity: E5:C2:AD:72:9E:F2 (random) d_05: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x0059 d_05: @00:00:00.006488 [00:00:00.006,469] bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff d_05: @00:00:00.006488 [00:00:00.006,469] mesh_test: Bluetooth initialized d_05: @00:00:00.006488 [00:00:00.006,469] mesh_test: Mesh initialized d_05: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: Primary Element: 0x0008 d_05: @00:00:00.006488 [00:00:00.006,469] bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000 d_01: @00:00:00.308898 [00:00:00.308,868] test_friendship: LPN: Polling 0x0001 (initial) d_00: @00:00:00.311379 [00:00:00.311,370] test_friendship: Friend: Poll from 0x0004 d_00: @00:00:00.311379 [00:00:00.311,370] test_friendship: Friend: established with 0x0004 d_02: @00:00:00.331743 [00:00:00.331,726] test_friendship: LPN: Polling 0x0001 (initial) d_00: @00:00:00.334581 [00:00:00.334,564] test_friendship: Friend: Poll from 0x0005 d_00: @00:00:00.334581 [00:00:00.334,564] test_friendship: Friend: established with 0x0005 d_03: @00:00:00.354204 [00:00:00.354,187] test_friendship: LPN: Polling 0x0001 (initial) d_00: @00:00:00.356698 [00:00:00.356,689] test_friendship: Friend: Poll from 0x0006 d_00: @00:00:00.356698 [00:00:00.356,689] test_friendship: Friend: established with 0x0006 d_05: @00:00:00.377048 [00:00:00.377,044] test_friendship: LPN: Polling 0x0001 (initial) d_00: @00:00:00.379212 [00:00:00.379,211] test_friendship: Friend: Poll from 0x0008 d_00: @00:00:00.379212 [00:00:00.379,211] test_friendship: Friend: established with 0x0008 d_01: @00:00:00.461606 [00:00:00.461,578] bt_mesh_lpn: Friendship established with 0x0001 d_01: @00:00:00.461606 [00:00:00.461,578] test_friendship: LPN: established with 0x0001 d_02: @00:00:00.485043 [00:00:00.485,015] bt_mesh_lpn: Friendship established with 0x0001 d_02: @00:00:00.485043 [00:00:00.485,015] test_friendship: LPN: established with 0x0001 d_03: @00:00:00.508834 [00:00:00.508,819] bt_mesh_lpn: Friendship established with 0x0001 d_03: @00:00:00.508834 [00:00:00.508,819] test_friendship: LPN: established with 0x0001 d_05: @00:00:00.530527 [00:00:00.530,517] bt_mesh_lpn: Friendship established with 0x0001 d_05: @00:00:00.530527 [00:00:00.530,517] test_friendship: LPN: established with 0x0001 d_04: @00:00:01.155274 [00:00:01.155,273] bt_mesh_lpn: No acceptable Friend Offers received d_04: @00:00:04.891602 ERROR: (WEST_TOPDIR/zephyr/tests/bluetooth/bsim_bt/bsim_test_mesh/src/test_friendship.c:393): evt_wait(LPN_ESTABLISHED, K_SECONDS(5)) failed with error -11 d_04: @00:00:04.891602 main: The TESTCASE FAILED with return code 2 d_00: @00:00:04.869426 main: The TESTCASE FAILED with return code 1 d_01: @00:00:00.715821 main: The TESTCASE FAILED with return code 1 d_03: @00:00:00.760743 main: The TESTCASE FAILED with return code 1 d_02: @00:00:00.738282 main: The TESTCASE FAILED with return code 1 d_05: @00:00:00.784180 main: The TESTCASE FAILED with return code 1 ```

Environment (please complete the following information):

aescolar commented 2 years ago

@alxelax can you please check in the Phy radio logs to see what is actually sent by each device and when each other device tries to receive and what it receives.

alxelax commented 2 years ago

@alxelax can you please check in the Phy radio logs to see what is actually sent by each device and when each other device tries to receive and what it receives.

Yep, sure. @aescolar could you provide more information how to collect these logs? Do I need to add some options when I build tests or to run it with specific options?

aescolar commented 2 years ago

@alxelax _The Phy can dump all radio activity to files. You control this with the -dump and -nodump command line options. This radio activity can be easily imported for analysis into the Ellisys Bluetooth Analyzer SW._ The results for each simulation end up in the results/ folder You don't need to import the simulation activity into the Ellisys, you can just look into the raw csv files if you want. (And the ellisys does not show you Rx activity , that you can only see in the corresponding csv files )

alxelax commented 2 years ago

logs.zip These are logs of tests. Archive includes all csv files for establish_multi.sh test. Additionally I put test log as test_log.txt file. It is possible to see in d_2G4_04.Tx.csv that 4th device sends some frames (this is broadcasted Friend Request). Also it is possible to see in d_2G4_00.Rx.csv that there are no these frames. Even more it is possible to see from test_log.txt and d_2G4_00.Rx.csv that bt_mesh_scan_cb returns what d_2G4_00.Rx.csv consists. As a consequence 0th device (friend node) doesn't hear 4th device (lpn's node Friend requests)

aescolar commented 2 years ago

Please check in the Rx file if the receiver is open at times when it could possibly receive those device 4 transmissions. Also please check that it is not receiving somebody else's packet at the time.

alxelax commented 2 years ago

Scanner in Mesh works from the BLE Mesh initialization with duty cycle about 100%. Rx should be open to the time of friendship establishment (according to timestamps from log files it is open). Regarding receiving somebody else's frame it seems very strange. Looking at Tx log and start stop time it is possible to see Friend Request takes 368 (time units) always. Looking at Rx log I see time diff (rx_time_stamp - start_time ) for receiving 368 tu length frame it spent 12645 tu !!! Then Rx spent 16662 tu fro receiving something that seems like nothing (or phy mistake ???) Rx spends again enormous long time for Rx 6070, 1487, 2369 tu. If compare with Tx time they are crossover-ed but why rx spends enormous time .

aescolar commented 2 years ago

Hi @alxelax , I took a look at the logs: The first 3 advertisement (transmissions) from device 4 overlap exactly in time and frequency with the first 3 advertisements from device 1. Device 0 happened to receive device 1 instead of 4(^1). The 4th advertisement from 4 happens while device 0 is listening in another frequency, in which it receives device 1 again. The 5th advertisement from 4 happens while device 0 is still receiving the previous packet from device 1. The 6th advertisement from 4 happens while device 0 is listening in another frequency, in which it receives device 5. Device 4 does not advertise anymore.

There does not seem to be any issues in the Phy. Device 4 just happens to try transmitting too few times given how busy device 0 is. This seems like a test case issue, not a simulation issues.


(^1) The default Phy modem is the magical one, which can receive any transmission irrespectively of received power level or interference, but will still only receive one packet at a time.

Note that in the Rx logs, the start_time is when the device opens its receiver listening for possible packets. It can be open for very long if so configured by SW. There is nothing wrong with it being open for several tenths of milliseconds. For a correctly received packet, the rx_time_stamp is the time in which the sync_word/phy_address of the locked received packet ends. For an SW stopped/aborted reception attempt, the rx_time_stamp is when the receiver was closed.

Time in the logs is all in microseconds.

alxelax commented 2 years ago

Hi @aescolar , thanks a lot for the detailed clarification. Now I see the cause of the issue. I'll modify the test to shift device starts.

aescolar commented 2 years ago

I'll modify the test to shift device starts.

Beware that if the test fails in this way it is probably too tough (meaning the test should probably allow for more time for devices to find each other or so). Note that in simulation tests are repeatable, but still you don't want to have tests that pass in simulation but in real life would be flaky.

Closing the issue.