task_wdt can get stuck in a loop where hardware reset is never fired

ehughes commented 2 years ago

Describe the bug

I have encountered a situation where the task_wdt subsystem get stuck in a state where it is continuously feeding the hardware watchdog but no other threads are allowed to run. This makes the system appear like it is "hung" and the watchdog will never actually reset the system.

Background

I have application software that can run either on an nrf52840 or nrf9160. I noticed that after adding the task_wdt, the firmware would run for roughly 36 hours before getting into a state where it would appear to be locked up. To discover that task_wdt subsystem was the culprit, I had setup 14 units into run in parallel and would run different firmware in groups to isolate which changes I had been making caused the issue. For each test group, I would record the power on time and the time when the crash would occur (as measured by a cloud backend that the system would have to check in with) . The test units in a group would not always simultaneously lock up after 36 hours, but they would eventually all would in discrete steps of 12 hours. For example, in a group of 5, 3 might hand at the 36 hour mark and the other two might be at 48 and 60 hours. I have run many iterations over the past several weeks and this behavior is consistent.

I have tracked down the error to a specific commit where I turned on the task_wdt. If I remove the task_wdt, the problem goes away.

My application has 3 threads each with its own task_wdt channel Also, there is logic that gets executed on the system work queue. I register a task_wdt channel for one of the functions in the system work queue with the assumption that any function that locks up in the queue will cause the watchdog to expire.

All for channels use a 5000mS timeout. Important : CONFIG_TASK_WDT_MIN_TIMEOUT is also set to 5000mS.

I am wondering if there is a race condition when task timeouts are exactly equal to CONFIG_TASK_WDT_MIN_TIMEOUT. I was able to attach a debugger to units that were locked up. (Segger Ozone + Segger J-Link). It looked like schedule_next_timeout was continuously getting called but the scheduler was never allowed to switch to other threads giving the impression of a system hang/lockup. The fact that I am doing a task_wdt_feed() in a function on the system work queue may also be a clue.

My hope was that the maintainer might have some insight into this issue. I am currently running addition test groups that have task_wdt channels with 5000mS timeouts and CONFIG_TASK_WDT_MIN_TIMEOUT set to 1000mS (a value much smaller). My hope is that this fixes the issue but I think there may need to some more documentation around the timeout values to avoid weird conditions where the system can never reset.

To Reproduce

This issue is very reproducible on my system, it just takes significant time. It may also occur with an NRF52840DK if we setup the same conditions (3 threads + functions on the system work queue) but there may nuances in timing resulting from the logic in my tasks.

I only have 1 NRF52840DK so it might take awhile to even see the issue. I think some LED blinks in a thread would be enough to see the issue as all threads appear to stop but the system is still executing under some context the continuously calls schedule_next_timeout()

I would like to figure out a way to have this trigger quicker but I have so far been unable to reliably figure out how.

Expected behavior

The task_wdt should never allow the system to lockup. Worst case is that the the system would reset.

Impact

Show stopper at this point. I need the unit to run 24/7 and reset if there are anomalies. I am considering just using the hardware watchdog on the nRF52840/nRF9160 with a different implementation.

martinjaeger commented 2 years ago

Thanks @ehughes for reporting the bug. This looks like a tricky one...

Two quick questions on my side:

Did you have the hardware watchdog fallback enabled and the device still got stuck without being reset?
Can you post the exact commit SHA of the Zephyr branch used for your tests?

ehughes commented 2 years ago

1.) CONFIG_TASK_WDT_HW_FALLBACK is set to y

2.) My west.yml references the nrf connect SDK:

f19ebe4 sdk-zephyr 2.6.99-ncs1

https://github.com/nrfconnect/sdk-zephyr/releases

I started two more tests yesterday afternoon. I should have some more data tomorrow. In my latest test groups I am experimenting with CONFIG_TASK_WDT_MIN_TIMEOUT in relation to the task timeouts. My best guess at this point is that they can never be exactly equal.

Also, ~~CONFIG_TASK_WDT_MIN_TIMEOUT~~ TASK_WDT_HW_FALLBACK_DELAY, is current left to its default value of 20.

martinjaeger commented 2 years ago

Also, CONFIG_TASK_WDT_MIN_TIMEOUT, is current left to its default value of 20.

Did you mean TASK_WDT_HW_FALLBACK_DELAY?

I'm wondering if for some reason a call to schedule_next_timeout from the k_timer callback (task_wdt_trigger function) could get interrupted by a call to the same function from task_wdt_feed. The system workqueue you mentioned runs in a thread with priority -1 or -2 by default, whereas the k_timer callback should run in the system timer interrupt context, so it should always be higher priority.

In order to validate this, it could still make sense to try wrapping this line between a k_sched_lock() and k_sched_unlock(). What do you think?

ehughes commented 2 years ago

Whoops, yes TASK_WDT_HW_FALLBACK_DELAY

I can try your suggestion. It will mostly likely not be until tomorrow evening when my current tests are concluded.

Thanks!

ehughes commented 2 years ago

@martinjaeger

I have some more test results. I test 14 units. 4 of which were NRF9160 (group1) and 10 of which were NRF52840 (group2 and group3).

In this test I was trying to determine 2 things. If the CONFIG_TASK_WDT_MIN_TIMEOUT had anything to do with the issue I am seeing and if having a task_wdt channel on the system work queue matter.

group1 (4 NRF9160) I changed the CONFIG_TASK_WDT_MIN_TIMEOUT to 1000mS and the 4 task channels to 5000mS group2(5 NRF52840) Same conditions as group 1 group3 (5 NRF52840) CONFIG_TASK_WDT_MIN_TIMEOUT is 5000mS and task channel timeouts are 5000mSec. I removed the task channel on the system work queue leaving 3 task channels in regular threads.

Once again I had failures @36 hours

group1 3/4 failed group2 1/5 failed group3 5/5 failed.

Believe that if I wait a bit longer (to the next multiple of 12 hours), I will see more units drop out.

I managed to find some NRF52840 DK units. I will be attempting to make a very simple test application for your review that sets up similar conditions (3 task channels on threads, 1 on the sys work queue).

Once I get this setup I'll see if I can recreate the issue there and that might be easier for you to review.

I do want to try your k_sched_lock() / k_sched_unlock() suggestion. Once I get the DK's setup I'll see about testing that change.

martinjaeger commented 2 years ago

Thanks a lot for the update. So it looks like adjusting the CONFIG_TASK_WDT_MIN_TIMEOUT doesn't really have any effect.

I managed to find some NRF52840 DK units. I will be attempting to make a very simple test application for your review that sets up similar conditions (3 task channels on threads, 1 on the sys work queue).

That would be great. I've got one NRF52840 DK and one NRF5340 DK here.

ehughes commented 2 years ago

@martinjaeger

I was able replicate the issue on NRF52840DKs and an NRF9160 using a simpler program that removed all of my application logic but retains the numberof threads, etc.

Setup:

I modified the existing task_wdt sample with the attached main.c and proj.conf.

There are 3 threads each with a task watchdog and 1 function on the sys work queue. CONFIG_TASK_WDT_MIN_TIMEOUT is set to 5000mS and each of the task_wdt channel timeouts is 5000mS. Each thread will toggle an LED every 500mS.

I used the standard west build / west flash on the sample program the dev boards.

I ran the test on QTY 2 NRF52840DK's and QTY1 NRF9160 DK

Results:

After 36 hours, each one of the units was in the locked condition. All of the LEDs stopped blinking on all of the test units. The 36 hour number is approximate as I had to record the time at start and then periodically check. I am within +/- 1 hour.

Code Modifications to the task_wdt sample

prj.conf


CONFIG_LOG=y

CONFIG_WATCHDOG=y
CONFIG_WDT_LOG_LEVEL_DBG=y
CONFIG_WDT_DISABLE_AT_BOOT=y

CONFIG_TASK_WDT=y
CONFIG_TASK_WDT_MIN_TIMEOUT=5000

CONFIG_THREAD_NAME=y

main.c

/*
 * Copyright (c) 2020 Libre Solar Technologies GmbH
 *
 * SPDX-License-Identifier: Apache-2.0
 */

#include <zephyr.h>
#include <device.h>
#include <drivers/watchdog.h>
#include <sys/reboot.h>
#include <task_wdt/task_wdt.h>
#include <sys/printk.h>
#include <stdbool.h>
#include <drivers/gpio.h>

static struct k_work_delayable bridge_work;
void bridge_work_handler(struct k_work *work);
int bridge_thread_wdt_id;

/* 1000 msec = 1 sec */
#define SLEEP_TIME_MS   1000

/* The devicetree node identifier for the "led0" alias. */
#define LED0_NODE DT_ALIAS(led0)
#define LED1_NODE DT_ALIAS(led1)
#define LED2_NODE DT_ALIAS(led2)
#define LED3_NODE DT_ALIAS(led3)

#if DT_NODE_HAS_STATUS(LED0_NODE, okay)
#define LED0    DT_GPIO_LABEL(LED0_NODE, gpios)
#define PIN0    DT_GPIO_PIN(LED0_NODE, gpios)
#define FLAGS0  DT_GPIO_FLAGS(LED0_NODE, gpios)
#else
/* A build error here means your board isn't set up to blink an LED. */
#error "Unsupported board: led0 devicetree alias is not defined"
#define LED0    ""
#define PIN0    0
#define FLAGS0  0
#endif

#if DT_NODE_HAS_STATUS(LED1_NODE, okay)
#define LED1    DT_GPIO_LABEL(LED1_NODE, gpios)
#define PIN1    DT_GPIO_PIN(LED1_NODE, gpios)
#define FLAGS1  DT_GPIO_FLAGS(LED1_NODE, gpios)
#else
/* A build error here means your board isn't set up to blink an LED. */
#error "Unsupported board: led1 devicetree alias is not defined"
#define LED1    ""
#define PIN1    0
#define FLAGS1  0
#endif

#if DT_NODE_HAS_STATUS(LED2_NODE, okay)
#define LED2    DT_GPIO_LABEL(LED2_NODE, gpios)
#define PIN2    DT_GPIO_PIN(LED2_NODE, gpios)
#define FLAGS2  DT_GPIO_FLAGS(LED2_NODE, gpios)
#else
/* A build error here means your board isn't set up to blink an LED. */
#error "Unsupported board: led2 devicetree alias is not defined"
#define LED2    ""
#define PIN2    0
#define FLAGS2  0
#endif

#if DT_NODE_HAS_STATUS(LED3_NODE, okay)
#define LED3    DT_GPIO_LABEL(LED3_NODE, gpios)
#define PIN3    DT_GPIO_PIN(LED3_NODE, gpios)
#define FLAGS3  DT_GPIO_FLAGS(LED3_NODE, gpios)
#else
/* A build error here means your board isn't set up to blink an LED. */
#error "Unsupported board: led3 devicetree alias is not defined"
#define LED3    ""
#define PIN3    0
#define FLAGS3  0
#endif

/*
 * To use this sample, either the devicetree's /aliases must have a
 * 'watchdog0' property, or one of the following watchdog compatibles
 * must have an enabled node.
 *
 * If the devicetree has a watchdog node, we get the watchdog device
 * from there. Otherwise, the task watchdog will be used without a
 * hardware watchdog fallback.
 */
#if DT_NODE_HAS_STATUS(DT_ALIAS(watchdog0), okay)
#define WDT_NODE DT_ALIAS(watchdog0)
#elif DT_HAS_COMPAT_STATUS_OKAY(st_stm32_window_watchdog)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(st_stm32_window_watchdog)
#elif DT_HAS_COMPAT_STATUS_OKAY(st_stm32_watchdog)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(st_stm32_watchdog)
#elif DT_HAS_COMPAT_STATUS_OKAY(nordic_nrf_watchdog)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(nordic_nrf_watchdog)
#elif DT_HAS_COMPAT_STATUS_OKAY(espressif_esp32_watchdog)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(espressif_esp32_watchdog)
#elif DT_HAS_COMPAT_STATUS_OKAY(silabs_gecko_wdog)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(silabs_gecko_wdog)
#elif DT_HAS_COMPAT_STATUS_OKAY(nxp_kinetis_wdog32)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(nxp_kinetis_wdog32)
#elif DT_HAS_COMPAT_STATUS_OKAY(microchip_xec_watchdog)
#define WDT_NODE DT_COMPAT_GET_ANY_STATUS_OKAY(microchip_xec_watchdog)
#endif

void main(void)
{
    const struct device *dev;
    bool led_is_on = true;
    int ret;

    dev = device_get_binding(LED3);
    if (dev == NULL) {
        return;
    }

    ret = gpio_pin_configure(dev, PIN3, GPIO_OUTPUT_ACTIVE | FLAGS3);
    if (ret < 0) {
        return;
    }

#ifdef WDT_NODE
    const struct device *hw_wdt_dev = DEVICE_DT_GET(WDT_NODE);
#else
    const struct device *hw_wdt_dev = NULL;
#endif

    printk("Task watchdog sample application.\n");

    if (!device_is_ready(hw_wdt_dev)) {
        printk("Hardware watchdog %s is not ready; ignoring it.\n",
               hw_wdt_dev->name);
        hw_wdt_dev = NULL;
    }

    task_wdt_init(hw_wdt_dev);

    k_work_init_delayable(&bridge_work,bridge_work_handler);

    k_work_reschedule(&bridge_work, K_MSEC(500));

    /* passing NULL instead of callback to trigger system reset */
    int task_wdt_id = task_wdt_add(5000U, NULL, NULL);

    while (true)
     {
        gpio_pin_set(dev, PIN3, (int)led_is_on);
        led_is_on = !led_is_on;
        task_wdt_feed(task_wdt_id);
        k_msleep(500);
    }
}

void control_thread0(void)
{
    int task_wdt_id;
    const struct device *dev;
    bool led_is_on = true;
    int ret;

    dev = device_get_binding(LED0);
    if (dev == NULL) {
        return;
    }

    ret = gpio_pin_configure(dev, PIN0, GPIO_OUTPUT_ACTIVE | FLAGS0);
    if (ret < 0) {
        return;
    }   
    printk("Control thread started.\n");

    /*
     * Add a new task watchdog channel with custom callback function and
     * the current thread ID as user data.
     */
    task_wdt_id = task_wdt_add(5000U, NULL,
                               (void *)k_current_get());

    while (true)
    {
        gpio_pin_set(dev, PIN0, (int)led_is_on);
        led_is_on = !led_is_on;
        task_wdt_feed(task_wdt_id);
        k_msleep(500);
    }
}

void control_thread1(void)
{
    int task_wdt_id;
    const struct device *dev;
    bool led_is_on = true;
    int ret;

    dev = device_get_binding(LED1);
    if (dev == NULL) {
        return;
    }

    ret = gpio_pin_configure(dev, PIN1, GPIO_OUTPUT_ACTIVE | FLAGS1);
    if (ret < 0) {
        return;
    }   
    printk("Control thread started.\n");

    /*
     * Add a new task watchdog channel with custom callback function and
     * the current thread ID as user data.
     */
    task_wdt_id = task_wdt_add(5000U, NULL,
                               (void *)k_current_get());

    while (true)
    {
        gpio_pin_set(dev, PIN1, (int)led_is_on);
        led_is_on = !led_is_on;
        task_wdt_feed(task_wdt_id);
        k_msleep(500);
    }
}

void control_thread2(void)
{
    int task_wdt_id;
    const struct device *dev;
    bool led_is_on = true;
    int ret;

    dev = device_get_binding(LED2);
    if (dev == NULL) {
        return;
    }

    ret = gpio_pin_configure(dev, PIN2, GPIO_OUTPUT_ACTIVE | FLAGS2);
    if (ret < 0) {
        return;
    }   
    printk("Control thread started.\n");

    /*
     * Add a new task watchdog channel with custom callback function and
     * the current thread ID as user data.
     */
    task_wdt_id = task_wdt_add(5000U, NULL,
                               (void *)k_current_get());

    while (true)
    {
        gpio_pin_set(dev, PIN2, (int)led_is_on);
        led_is_on = !led_is_on;
        task_wdt_feed(task_wdt_id);
        k_msleep(500);
    }
}

K_THREAD_DEFINE(control0, 1024, control_thread0, NULL, NULL, NULL, -1, 0, 1000);
K_THREAD_DEFINE(control1, 1024, control_thread1, NULL, NULL, NULL, -1, 0, 1000);
K_THREAD_DEFINE(control2, 1024, control_thread2, NULL, NULL, NULL, -1, 0, 1000);

void bridge_work_handler(struct k_work *work)
{

    task_wdt_feed(bridge_thread_wdt_id);

    k_work_reschedule(&bridge_work, K_MSEC(500));

}

alexanderendian commented 2 years ago

I was intrigued by the issue title and decided to see if I could spot the error.

I noticed that schedule_next_timeout takes current_ticks as an uint32_t when I think it should be an int64_t. It's worth pointing out that 2^32 / 32768 Hz = 36.4 hours.

If my reading of the code is correct, once the tick count overflows 32-bits, schedule_next_timeout will schedule the trigger timer to a time in the past, causing (presumably) the timer to fire immediately which will call schedule_next_timeout again. That would explain why the threads lock up but the hardware watchdog is being fed.

I have no idea why this doesn't happen 100% of the time after 36 hours, or why it happens at 12 hour increments thereafter.

ehughes commented 2 years ago

@alexanderwachter

That is an interesting discovery. I think it would be easy to make the issue happen more quickly by having the tick count overflow as some smaller bit depth. I do think you may be on to something.

To work around the issue, I came up with an alternate implementation where each channel has an fixed timeout equal to the wdt hardware timeout. In this implementation, each channel simply sets a flag to indicate it has checked in. The thread that feeds the wdt looks for the sum of these flags before it will issue the feed. It removes all integer arithmetic and possibility that anything can underflow/overflow. In my current test case, I have had 14 units running for 96 hours without issue.

henrikbrixandersen commented 2 years ago

To work around the issue, I came up with an alternate implementation where each channel has an fixed timeout equal to the wdt hardware timeout. In this implementation, each channel simply sets a flag to indicate it has checked in. The thread that feeds the wdt looks for the sum of these flags before it will issue the feed. It removes all integer arithmetic and possibility that anything can underflow/overflow. In my current test case, I have had 14 units running for 96 hours without issue.

Are you able to post your modifications as a PR for review?

henrikbrixandersen commented 2 years ago

Bumping to medium severity since this is causing lock-ups in what should be (task) watchdog protected code.

ehughes commented 2 years ago

@henrikbrixandersen

Just as a clarification, my alternate implementation is a completely separate set of functions/file that implement a task_wdt like behavior but with features remove (separate timeouts on each thread). This was so I could test a few things in parallel. My implementation would break the current API.

@alexanderwachter 's suggestion is a good one. I think I may be able to accelerate the time it takes to show the bug. If that is the case then I think the 64-bit timeout suggestion may be the fix (but will take several days to verify).

I'll be able to test some more tomorrow once I can get to my test hardware(I am UTC -5:00)

martinjaeger commented 2 years ago

I noticed that schedule_next_timeout takes current_ticks as an uint32_t when I think it should be an int64_t. It's worth pointing out that 2^32 / 32768 Hz = 36.4 hours.

If my reading of the code is correct, once the tick count overflows 32-bits, schedule_next_timeout will schedule the trigger timer to a time in the past, causing (presumably) the timer to fire immediately which will call schedule_next_timeout again. That would explain why the threads lock up but the hardware watchdog is being fed.

This does indeed look like the root cause. I was also thinking of a potential overflow, but didn't realize that the nRF SysTick runs at completely different speed than the STM32 I'm usually working with. So my calculations just wouldn't match anything close to 36 hours.

It's still strange though, that it does not get stuck on all devices at the same time...

Created a draft PR (linked above) which I'm currently testing with the sample code provided by @ehughes (thanks!).

martinjaeger commented 2 years ago

After more than 48 hours my nRF52840 DK is still happily blinking with the fix applied as suggested by @alexanderendian. Any further test results from you @ehughes?

ehughes commented 2 years ago

@martinjaeger No further (at least useful) test results on my side. I tried artificially limiting the ticker to something smaller that 32-bits (say 20 bits) to trigger the issue faster but I got different behavior.

I can retry my test units with the PR and let you know.

martinjaeger commented 2 years ago

I can retry my test units with the PR and let you know.

That would be good. If none of your multiple test units got stuck after 2 days I think we can consider the bug fixed.

ehughes commented 2 years ago

@martinjaeger

I have been running QTY 2 NRF52840DK's and QTY 1 NRF9160DK for about 60 Hours without issue. I think we got it.

martinjaeger commented 2 years ago

Yep, also here it didn't get stuck after about 1 week. So I'll convert the PR from draft to a "real" PR.

zephyrproject-rtos / zephyr