Clearing of previously initialized data during IPv6 interface init causes infinite loop, memory corruption in timer ISR

I'm currently working towards contributing an Ethernet device driver (Xilinx GEM) for both actual ARMv7 Cortex-A and Cortex-R hardware (Xilinx Zynq-7000, UltraScale RPU) and, for initial QA and unit testing, the Cortex-R5 QEMU target. While I've already had the driver up and running on actual Zynq hardware, QEMU execution kept freezing (without any indication of an exception) after the boot banner is displayed. This behaviour is reproducible on an actual Cortex-R5 UltraScale RPU, which I have used for investigation as I have a matching JTAG debugger available.

Target board: qemu_cortex_r5 or actual Xilinx ZCU102 board (work in progress)
Application: any application containing a network driver plus the standard configuration of the network stack, to which IPv4 and both TCP and UDP support were added. Built on Linux using the Zephyr SDK, optimizations disabled, extra build parameters -g3 -gdwarf-2. Kernel configuration is tickless kernel, tickless idle. System clock driver is Xilinx TTC, as the UltraScale RPU doesn't have any other timer facilities.
Observation: execution freezes shortly after the boot banner is displayed.
Further investigation performed on Xilinx ZCU102 board using a Lauterbach JTAG debugger.

Based on the investigations using the hardware debugger, I think that there might be two different factors contributing to the system freezing. These might affect both the kernel and the network stack.

The relevant parts of the boot process are:

The system clock is set up normally via z_clock_driver_init. The timer is configured and started properly.
Eventually, the point is reached where the network interface associated with the Ethernet device driver is started. This is the first time that the clock is accessed in order to set a timeout in tickless mode. The corresponding call stack is:

iface_ipv6_start net_if_start_rs k_delayed_work_submit k_delayed_work_submit_to_queue (delay = 100) z_add_timeout z_clock_set_timeout (101)

By the time the timer ISR, ttc_isr, is first executed, z_clock_annonuce is called with a parameter value of 101 - this matches the timeout that was set previously. It's here that things start to go wrong, as z_clock_announce will never be left and is stuck in an infinite loop.

The while-loop within z_clock_announce is entered as first() returns a non-NULL pointer and first()->dticks <= announce_remaining evaluates true:

corruption1

The arithmetic operations, both on the stack and within the struct pointed to by t look plausible, all is OK up to the point where the branch to remove_timeout(t) is taken:

corruption2

remove_timeout appears plausible as well, next() is non-NULL and effectively 0 is added to next()->dticks. The branch to sys_dlist_remove is next:

corruption3

Within sys_dlist_remove, the line "node->prev->next = ..." looks concerning as node->prev is NULL, but the disassembly shows that code was generated to ensure that node->prev is non-NULL. The assignment is not performed. From sys_dlist_remove, sys_dnode_init is called which effectively clears the node:

corruption7

This is the situation by the end of the execution of remove_timeout:

corruption8

From there, execution continues in z_clock_announce from where t->fn(t) (line 240) is called. This call effectively equals work_timeout(t). Within work_timeout, the CONTAINER_OF macro resolves the parent structure of t, struct k_delayed_work w. At this point, note that w->work_q is NULL. The other members of w are plausible (w->work.handler = function pointer to rs_timeout, w->timeout.fn = function pointer to work_timeout). k_work_submit_to_queue is called with the first parameter effectively being NULL:

corruption9

For some reason, k_work_submit_to_queue is only displayed as the raw disassembly... Next to the disassembly, note the memory dump starting at address 0, which on an ARM platform is the location of the exception vectors:

corruption10

By the time the end of k_work_submit_to_queue is reached, the memory at address 0 has been corrupted by k_queue_append due to the NULL function parameter:

corruption11

The reset and undefined instruction exception vectors have been overwritten, which could lead to some serious malfunctions if the system were to ever leave the ISR execution. Yet, this doesn't happen (screenshot taken during a subsequent execution of the while-loop within z_clock_announce):

corruption12

The loop is never left as first() always returns the same value for t and first()->dticks <= announce_remaining always resolves to 0 <= 0.

From this situation, I've investigated as to why w->work_q is NULL when k_work_submit_to_queue is called from within work_timeout. The memory location containing the work_q pointer is accessed three times during the boot process before the timer ISR is first executed:

z_bss_zero -> trivial
k_delayed_work_submit_to_queue originating from net_if_start_rs, the non-NULL value 0x11E010 is valid:

corruption13

k_delayed_work_init originating from iface_ipv6_init clears the pointer again (screenshot highlights the next instruction to be performed, which will write NULL to the memory location):

corruption14

When holding program execution immediately before the work_q pointer is reverted to NULL for a while so that the next step will jump to the timer ISR as the timeout has elapsed in the meantime, at least the first handling of the timeout works as excepted. work_q is a valid pointer and the loop within z_clock_announce is left.

Is this some sort of race condition or a misconfiguration that can be resolved in some other way? Should k_work_submit_to_queue have a safeguard against a NULL pointer? Is there maybe a problem with the while-loop's exit condition within z_clock_announce as both components keep evaluating to true and first() always returns the same element?

Impressive analysis :thumbsup:

I've looked further into this issue, here's what I've found out in the meantime:

During the regular network stack initialization, there shouldn't be any risk of a race condition: the delayed work struct rs_timer is initialized via net_if_init -> iface_ipv6_init -> iface_ipv6_nd_init. The first time k_delayed_work_submit_to_queue is called for rs_timer should be via net_if_post_init -> net_if_up -> iface_ipv6_start -> net_if_start_rs.

Both net_if_init and net_if_post_init are called from within init_rx_queues in a suitable order, this whole operation is triggered by net_init which is registered via SYS_INIT to be executed in the POST_KERNEL phase.

Just browsing through the code it appears like the function net_if_start_rs isn't called directly from anywhere else. Yet, I found out that it is also accessed via the work_q thread which extracts a job from the queue whose handler carrier_on calls net_if_up -> iface_ipv6_start -> net_if_start_rs.

Therefore, here's what causes the crash in chronological order:

carrier_on is executed via the work queue thread -> eventually writes valid data to the rs_timer struct.
From the context that runs the POST_KERNEL initialization, net_init -> init_rx_queues -> net_if_init is called, rs_timer is zeroed, this initialization should have been executed prior to the work_q thread.
Before net_init -> init_rx_queues -> net_if_post_init is called, the first timer interrupt occurs and the timer ISR is executed. rs_timer data is currently NULL.

If the work_q thread hadn't called carrier_on already, the timer ISR wouldn't have interrupted the chain of net_if_init and net_if_post_init.

Therefore, this problem is most likely down to the network device driver. On actual hardware where an MDIO interface to the PHY is available and the link status shall be monitored, the link check is periodically polled by a low-prio auxiliary thread (as not all Zynq/UltraScale-based boards have the PHY's interrupt line actually connected). Yet, if PHY management is deferred to some other component of the system or a PHY doesn't exist (as is the case with QEMU), a default link speed the the permanent availability of a physical link is assumed. In this case, net_eth_carrier on is called from within the driver's iface_init function, just after ethernet_init(iface). I haven't yet tried to move the carrier handling in PHY-less mode to the start/stop functions of the driver, but based on my findings, I assume that the interface init function isn't the right context to indicate a carrier state change via the work queue?

Hi @ibirnbaum, sorry for late reply, just returned from vacation. Is this Ethernet driver somewhere where I could take a look at it?

The GMAC driver currently in zephyr tree is doing similar things, it sets net_if_flag_set(iface, NET_IF_NO_AUTO_START); so that network interface is not taken up automatically. Then it starts to poll the PHY link status, and only if the cable is connected, it calls net_eth_carrier_on(). There might be unexpected side effects in net stack if the interface is marked UP but the actual interface is still down.

Hi Jukka, sorry for the late reply on my side as well. I'll have a look at how the GMAC driver handles this issue, until you pointed out this example, I only happened to have come across net_eth_carrier_on/off. I think that the problem should still be reproducible by reverting the fix I eventually came up with (moving the carrier_on/off calls to the start/stop functions of the driver). Have a look at my forked repository: https://github.com/weidmueller/zephyr.git, branch zynq_zynqmp. In this branch, the qemu_cortex_r5 target uses my driver for the Xilinx GEM controller. Have a look at drivers/ethernet/eth_xlnx_gem.c, in eth_xlnx_gem_iface_init, place a net_eth_carrier_on() in the else-branch starting at line 2135, and remove the carrier-related calls from eth_xlnx_gem_start/stop_device. Building an application with the GEM driver and networking with IPv6 enabled should result in the timing behaviour observed previously.

@ibirnbaum if you can trace it down to the approximate root cause (kernel vs timer driver vs IP stack) then we can assign this properly.

Hi all, sorry for this very lengthy propagation delay, I had been moved to other projects in the meantime...

From my point of view, this whole issue came to be due to me not knowing prior to writing the driver that declaring "carrier up" in the early init phase may already trigger the work queue thread prior to the system's initialization completing properly, resulting in the race condition documented above. I'd therefore suggest adding a warning or a limitation to the documentation of the carrier on/off functions.

Additionally, the cause of the crash would have been more easily detectable if the work queue related functions such as k_work_submit_to_queue, which access the address pointed to by a delayed work struct's work_q pointer, contained an assertion for work_q != NULL. If such a modification would be both feasible (depending on the number of occurrences of work_q accesses in the code and the performance penalty of a corresponding number of assertions) and desirable, this whole issue would probably become more of a kernel-related issue rather than a driver- or networking stack-related issue. As this was most likely a case of API misuse, I guess the networking stack is the component least to blame.

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

zephyrproject-rtos / zephyr

Clearing of previously initialized data during IPv6 interface init causes infinite loop, memory corruption in timer ISR #26409