Closed ibirnbaum closed 3 years ago
Impressive analysis :thumbsup:
I've looked further into this issue, here's what I've found out in the meantime:
During the regular network stack initialization, there shouldn't be any risk of a race condition: the delayed work struct rs_timer is initialized via net_if_init -> iface_ipv6_init -> iface_ipv6_nd_init. The first time k_delayed_work_submit_to_queue is called for rs_timer should be via net_if_post_init -> net_if_up -> iface_ipv6_start -> net_if_start_rs.
Both net_if_init and net_if_post_init are called from within init_rx_queues in a suitable order, this whole operation is triggered by net_init which is registered via SYS_INIT to be executed in the POST_KERNEL phase.
Just browsing through the code it appears like the function net_if_start_rs isn't called directly from anywhere else. Yet, I found out that it is also accessed via the work_q thread which extracts a job from the queue whose handler carrier_on calls net_if_up -> iface_ipv6_start -> net_if_start_rs.
Therefore, here's what causes the crash in chronological order:
If the work_q thread hadn't called carrier_on already, the timer ISR wouldn't have interrupted the chain of net_if_init and net_if_post_init.
Therefore, this problem is most likely down to the network device driver. On actual hardware where an MDIO interface to the PHY is available and the link status shall be monitored, the link check is periodically polled by a low-prio auxiliary thread (as not all Zynq/UltraScale-based boards have the PHY's interrupt line actually connected). Yet, if PHY management is deferred to some other component of the system or a PHY doesn't exist (as is the case with QEMU), a default link speed the the permanent availability of a physical link is assumed. In this case, net_eth_carrier on is called from within the driver's iface_init function, just after ethernet_init(iface). I haven't yet tried to move the carrier handling in PHY-less mode to the start/stop functions of the driver, but based on my findings, I assume that the interface init function isn't the right context to indicate a carrier state change via the work queue?
Hi @ibirnbaum, sorry for late reply, just returned from vacation. Is this Ethernet driver somewhere where I could take a look at it?
The GMAC driver currently in zephyr tree is doing similar things, it sets net_if_flag_set(iface, NET_IF_NO_AUTO_START);
so that network interface is not taken up automatically. Then it starts to poll the PHY link status, and only if the cable is connected, it calls net_eth_carrier_on()
. There might be unexpected side effects in net stack if the interface is marked UP but the actual interface is still down.
Hi Jukka, sorry for the late reply on my side as well. I'll have a look at how the GMAC driver handles this issue, until you pointed out this example, I only happened to have come across net_eth_carrier_on/off. I think that the problem should still be reproducible by reverting the fix I eventually came up with (moving the carrier_on/off calls to the start/stop functions of the driver). Have a look at my forked repository: https://github.com/weidmueller/zephyr.git, branch zynq_zynqmp. In this branch, the qemu_cortex_r5 target uses my driver for the Xilinx GEM controller. Have a look at drivers/ethernet/eth_xlnx_gem.c, in eth_xlnx_gem_iface_init, place a net_eth_carrier_on() in the else-branch starting at line 2135, and remove the carrier-related calls from eth_xlnx_gem_start/stop_device. Building an application with the GEM driver and networking with IPv6 enabled should result in the timing behaviour observed previously.
@ibirnbaum if you can trace it down to the approximate root cause (kernel vs timer driver vs IP stack) then we can assign this properly.
Hi all, sorry for this very lengthy propagation delay, I had been moved to other projects in the meantime...
From my point of view, this whole issue came to be due to me not knowing prior to writing the driver that declaring "carrier up" in the early init phase may already trigger the work queue thread prior to the system's initialization completing properly, resulting in the race condition documented above. I'd therefore suggest adding a warning or a limitation to the documentation of the carrier on/off functions.
Additionally, the cause of the crash would have been more easily detectable if the work queue related functions such as k_work_submit_to_queue, which access the address pointed to by a delayed work struct's work_q pointer, contained an assertion for work_q != NULL. If such a modification would be both feasible (depending on the number of occurrences of work_q accesses in the code and the performance penalty of a corresponding number of assertions) and desirable, this whole issue would probably become more of a kernel-related issue rather than a driver- or networking stack-related issue. As this was most likely a case of API misuse, I guess the networking stack is the component least to blame.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I'm currently working towards contributing an Ethernet device driver (Xilinx GEM) for both actual ARMv7 Cortex-A and Cortex-R hardware (Xilinx Zynq-7000, UltraScale RPU) and, for initial QA and unit testing, the Cortex-R5 QEMU target. While I've already had the driver up and running on actual Zynq hardware, QEMU execution kept freezing (without any indication of an exception) after the boot banner is displayed. This behaviour is reproducible on an actual Cortex-R5 UltraScale RPU, which I have used for investigation as I have a matching JTAG debugger available.
Based on the investigations using the hardware debugger, I think that there might be two different factors contributing to the system freezing. These might affect both the kernel and the network stack.
The relevant parts of the boot process are:
By the time the timer ISR, ttc_isr, is first executed, z_clock_annonuce is called with a parameter value of 101 - this matches the timeout that was set previously. It's here that things start to go wrong, as z_clock_announce will never be left and is stuck in an infinite loop.
The while-loop within z_clock_announce is entered as first() returns a non-NULL pointer and first()->dticks <= announce_remaining evaluates true:
The arithmetic operations, both on the stack and within the struct pointed to by t look plausible, all is OK up to the point where the branch to remove_timeout(t) is taken:
remove_timeout appears plausible as well, next() is non-NULL and effectively 0 is added to next()->dticks. The branch to sys_dlist_remove is next:
Within sys_dlist_remove, the line "node->prev->next = ..." looks concerning as node->prev is NULL, but the disassembly shows that code was generated to ensure that node->prev is non-NULL. The assignment is not performed. From sys_dlist_remove, sys_dnode_init is called which effectively clears the node:
This is the situation by the end of the execution of remove_timeout:
From there, execution continues in z_clock_announce from where t->fn(t) (line 240) is called. This call effectively equals work_timeout(t). Within work_timeout, the CONTAINER_OF macro resolves the parent structure of t, struct k_delayed_work w. At this point, note that w->work_q is NULL. The other members of w are plausible (w->work.handler = function pointer to rs_timeout, w->timeout.fn = function pointer to work_timeout). k_work_submit_to_queue is called with the first parameter effectively being NULL:
For some reason, k_work_submit_to_queue is only displayed as the raw disassembly... Next to the disassembly, note the memory dump starting at address 0, which on an ARM platform is the location of the exception vectors:
By the time the end of k_work_submit_to_queue is reached, the memory at address 0 has been corrupted by k_queue_append due to the NULL function parameter:
The reset and undefined instruction exception vectors have been overwritten, which could lead to some serious malfunctions if the system were to ever leave the ISR execution. Yet, this doesn't happen (screenshot taken during a subsequent execution of the while-loop within z_clock_announce):
The loop is never left as first() always returns the same value for t and first()->dticks <= announce_remaining always resolves to 0 <= 0.
From this situation, I've investigated as to why w->work_q is NULL when k_work_submit_to_queue is called from within work_timeout. The memory location containing the work_q pointer is accessed three times during the boot process before the timer ISR is first executed:
When holding program execution immediately before the work_q pointer is reverted to NULL for a while so that the next step will jump to the timer ISR as the timeout has elapsed in the meantime, at least the first handling of the timeout works as excepted. work_q is a valid pointer and the loop within z_clock_announce is left.
Is this some sort of race condition or a misconfiguration that can be resolved in some other way? Should k_work_submit_to_queue have a safeguard against a NULL pointer? Is there maybe a problem with the while-loop's exit condition within z_clock_announce as both components keep evaluating to true and first() always returns the same element?