zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.09k stars 6.2k forks source link

Console showing frequent usb warnings: <wrn> usb_device: Failed to write endpoint buffer 0x82 #20987

Closed sslupsky closed 3 years ago

sslupsky commented 4 years ago

Describe the bug Build the blinky application with usb console and shell. When there is shell activity, there are frequent console warnings:

[00:09:21.039,276] <wrn> usb_device: Failed to write endpoint buffer 0x82

I have also noticed occasional dropped characters in the output stream.

To Reproduce Steps to reproduce the behavior:

  1. west build -p auto -b adafruit_feather_m0_basic_proto
  2. boot board and connect terminal (I am using screen -L /dev/cu.usbmodem0_011)
  3. type "help" several times

The actual board is not the Adafruit board but my own custom board. I used the Adafruit board settings for my board.

Expected behavior No warnings.

Impact The console message is an annoyance. The dropped characters are a concern.

Screenshots or console output Here is a sample output of the warning message:

uart:~$ help
Please press the <Tab> button to see all available commands.
You can also use the <Tab> button to prompt or auto-complete all commands or its subcommands.
You can try to call commands with <-h> or <--help> parameter for more information.
Shell supports following meta-keys:
Ctrl+a, Ctrl+b, Ctrl+c, Ctrl+d, Ctrl+e, Ctrl+f, Ctrl+k, Ctrl+l, Ctrl+n, Ctrl+p, Ctrl+u, Ctrl+w
Alt+b, Alt+f.
Please refer to shell documentation for more details.
[00:09:21.039,276] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:21.039,672] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.018,493] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.026,367] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.026,580] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.027,313] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.027,526] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.999,572] <wrn> usb_device: Failed to write endpoint buffer 0x82
[00:09:22.999,969] <wrn> usb_device: Failed to write endpoint buffer 0x82

I also noticed in an earlier build of my blinky app that the output would occasionally drop characters. Here is a sample from that output (I added line to print the loop cnt value):

920Message 921Message 922Message 923Message 924Message 925Message 926Message 927Message 928Message 929Message 930Message 931Message 932Message 933Message 934M1791Mesage 1792Message 1793Message 1794Message 1795Message 1796Message 1797Message 1798Message 1799Message 1800Message 1801Message 1802Message 1803Message 1804Message 1805Message 1806Message 1807Mesage 1808Message 1809Message 1810Message 1811Message 181Message 1813Message 1814Message 1815Messae 1816Message 1817Message 1818Message 1819Message1820Message 1821Message 1822Message 1823Message 1824Message 1825Message 

If you look closely at the output, you will see it occasionally drops a character.

Environment (please complete the following information):

jfischer-no commented 4 years ago

@sslupsky Please provide information how to reproduce with an in-tree sample or a diff with your changes in samples/basic/blinky(?).

sslupsky commented 4 years ago

@jfischer-phytec-iot I believe you can use the samples/basic/blinky to reproduce the problem. I've added a low power state to my board and I forced the power state to active during the testing I described above, so I think it should not be affecting anything. Here is the diff for main.c:

11,21d10
< #include <sys/printk.h>
< #include <shell/shell.h>
< #include <shell/shell_uart.h>
< #include <version.h>
< #include <logging/log.h>
< #include <stdlib.h>
< 
< #include <power/power.h>
< #include <usb/usb_device.h>
< #include <soc.h>
< 
28,29d16
< LOG_MODULE_REGISTER(app);
< 
39,40d25
<     sys_pm_force_power_state(SYS_POWER_STATE_ACTIVE);
< 
48,63d32
< 
< static int cmd_sleep(const struct shell *shell, size_t argc, char **argv)
< {
<       ARG_UNUSED(argc);
<       ARG_UNUSED(argv);
< 
<     ADC->CTRLA.bit.ENABLE = 0;
<     usb_dc_detach();
<     sys_pm_force_power_state(SYS_POWER_STATE_AUTO);
< 
<       return 0;
< }
< 
< 
< //  Create root command "sleep"
< SHELL_CMD_ARG_REGISTER(sleep, NULL, "enable power management", cmd_sleep, 1, 0);

Here is my prj.conf:

# Drivers
CONFIG_ADC=y
CONFIG_ADC_0=y
CONFIG_I2C=y
CONFIG_FLASH=y

CONFIG_USB_DEVICE_STACK=y
CONFIG_USB_CDC_ACM=y
CONFIG_USB_DEVICE_PRODUCT="Scanimetrics WiTAP ISB"

# Shell
CONFIG_SHELL=y
CONFIG_ADC_SHELL=y
CONFIG_GPIO_SHELL=y
CONFIG_FLASH_SHELL=y
CONFIG_UART_SHELL_ON_DEV_NAME="CDC_ACM_0"

# Logging
# CONFIG_PRINTK=y
CONFIG_LOG=y

# CPP support
#CONFIG_CPLUSPLUS=y

and this is my board defconfig:

# SPDX-License-Identifier: Apache-2.0

CONFIG_ARM=y
CONFIG_SOC_FAMILY_SAM0=y
CONFIG_SOC_SERIES_SAMD21=y
CONFIG_SOC_PART_NUMBER_SAMD21G18A=y
CONFIG_BOARD_SCANIMETRICS_WITAP_ISB=y
CONFIG_USE_CODE_PARTITION=y
CONFIG_SOC_ATMEL_SAMD_XOSC32K=y
CONFIG_SOC_ATMEL_SAMD_XOSC32K_AS_MAIN=y
CONFIG_CORTEX_M_SYSTICK=n
CONFIG_SAM0_RTC_TIMER=y
CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=32768
CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000
CONFIG_SYS_POWER_MANAGEMENT=y
CONFIG_SYS_POWER_SLEEP_STATES=y
CONFIG_SYS_PM_MIN_RESIDENCY_SLEEP_1=1
CONFIG_SYS_PM_MIN_RESIDENCY_SLEEP_2=10
CONFIG_CONSOLE=y
CONFIG_UART_CONSOLE=y
CONFIG_USB_UART_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_UART_SAM0=y
CONFIG_UART_INTERRUPT_DRIVEN=y
CONFIG_GPIO=y
CONFIG_GPIO_SAM0=y
CONFIG_WATCHDOG=y
CONFIG_WDT_SAM0=y
CONFIG_SPI=n
CONFIG_SPI_SAM0=n
CONFIG_PINMUX=y
CONFIG_PINMUX_SAM0=y
CONFIG_USB=y
CONFIG_USB_DC_SAM0=y
carlescufi commented 4 years ago

@nzmichaelh or @benpicco do you have any additional insight in this issue?

sslupsky commented 4 years ago

I looked into this issue further. It appears to be a general USB issue, not necessarily specific to the sam0. This issue manifests in usb_write() in usb_device.c. The function outputs a warning if it retries a write. I believe this should be a debug output instead of a warning. A check should be made after to determine if the write actually failed and if it does, a warning should be issued. Strictly speaking, this is likely unnecessary as the calling function will likely output an error.

Second, and likely more importantly, there is a race condition here. The numbers of retries is an arbitrary number and depending on the scheduler, the amount of time it takes to retry here is arbitrary. I found that making the message a debug (and turning off debug), reduced the time to the point that write failed quite often. Since the USB slot time is 1 ms, I recommend k_yield() be replaced with k_usleep(500L). This would delay the next retry for 1/2 the USB slot time.

existing usb_write():

int usb_write(u8_t ep, const u8_t *data, u32_t data_len, u32_t *bytes_ret)
{
    int tries = CONFIG_USB_NUMOF_EP_WRITE_RETRIES;
    int ret;

    do {
        ret = usb_dc_ep_write(ep, data, data_len, bytes_ret);
        if (ret == -EAGAIN) {
            LOG_WRN("Failed to write endpoint buffer 0x%02x", ep);
            k_yield();
        }

    } while (ret == -EAGAIN && tries--);

    return ret;
}

proposed change:

int usb_write(u8_t ep, const u8_t *data, u32_t data_len, u32_t *bytes_ret)
{
    int tries = CONFIG_USB_NUMOF_EP_WRITE_RETRIES;
    int ret;

    do {
        ret = usb_dc_ep_write(ep, data, data_len, bytes_ret);
        if (ret == -EAGAIN) {
            LOG_DBG("Failed to write endpoint buffer 0x%02x, %d retries remaining", ep, tries);
            k_usleep(500L);
        }

    } while (ret == -EAGAIN && tries--);

    if ( tries < 0 ) {
        LOG_WRN("Failed to write endpoint buffer 0x%02x", ep);
    }
    return ret;
}
jfischer-no commented 4 years ago

I looked into this issue further. It appears to be a general USB issue, not necessarily specific to the sam0. This issue manifests in usb_write() in usb_device.c. The function outputs a warning if it retries a write. I believe this should be a debug output instead of a warning.

No. It has to be a warning. If it happens for an endpoint other than control then something is fishy in the driver.

Second, and likely more importantly, there is a race condition here. The numbers of retries is an arbitrary number and depending on the scheduler, the amount of time it takes to retry here is arbitrary. I found that making the message a debug (and turning off debug), reduced the time to the point that write failed quite often. Since the USB slot time is 1 ms, I recommend k_yield() be replaced with k_usleep(500L). This would delay the next retry for 1/2 the USB slot time.

regardless, warning should not appear. The driver is buggy and it does not make any sense to discus anything in the stack self until the driver passes all testusb tests.

AnanthJaisankar commented 4 years ago

Facing same issue: [00:00:38.034,088] usb_device: Failed to write endpoint buffer 0x82
[00:00:38.034,118] usb_device: Failed to write endpoint buffer 0x82
[00:00:38.034,118] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,149] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,179] usb_device: Transfer error -11 [00:00:38.034,240] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,515] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,545] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,576] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,606] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,759] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,790] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,820] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,851] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.034,881] usb_device: Transfer error -11 [00:00:38.035,614] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.035,644] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.035,675] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.035,705] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.035,705] usb_device: Transfer error -11 [00:00:38.036,102] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.036,132] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.036,163] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.036,193] usb_device: Failed to write endpoint buffer 0x82 [00:00:38.036,224] usb_device: Transfer error -11 and i'm using usb_cdc_acm on nrf52840 dongle

binkboyd commented 4 years ago

Using the below patch, based on https://github.com/zephyrproject-rtos/zephyr/issues/20250 we are no longer seeing the "Failed to write endpoint buffer" after transferring a few hundred megabytes. This is after getting it consistently along with data corruption previously. It should be noted that our use case mostly consists of single direction transfers from device to host, so high throuput from host to device has not been tested.

diff --git a/subsys/usb/class/cdc_acm.c b/subsys/usb/class/cdc_acm.c
index 3f7082374d..0a3bcf25a4 100644
--- a/subsys/usb/class/cdc_acm.c
+++ b/subsys/usb/class/cdc_acm.c
@@ -303,7 +303,7 @@ static void tx_work_handler(struct k_work *work)
        LOG_DBG("Got %d bytes from ringbuffer send to ep %x", len, ep);
-       usb_transfer(ep, dev_data->tx_buf, len, USB_TRANS_WRITE,
+       usb_transfer(ep, dev_data->tx_buf, len, USB_TRANS_WRITE | USB_TRANS_NO_ZLP,
                     cdc_acm_write_cb, dev_data);
 }
jfischer-no commented 4 years ago

@binkboyd @jrananth1032 please open a new issue for this bug in CDC ACM class. This issue is specific to usb_dc_sam0.c driver.

sslupsky commented 4 years ago

@jfischer-phytec-iot I think I mentioned that this issue is not constrained to the sam0. Moreover, @jrananth1032 indicated he observed the problem on nRF.

Also, you mentioned earlier

The driver is buggy and it does not make any sense to discus anything in the stack self until the driver passes all testusb tests.

Are you implying that the sam0 driver does not pass the tests? What is failing?

Notwithstanding my comments above, I have found that the existing driver assumes that if the endpoint is busy, a write is in progress and it drops the frame. Why does the driver make that assumption? It seems to me that there should be a timeout waiting for the previous transfer to complete? The following patch to the usb_dc_sam0.c driver appears to dramatically reduce the frequency of the warning:

@@ -504,10 +504,13 @@ int usb_dc_ep_write(u8_t ep, const u8_t *buf, u32_t len, u32_t *ret_bytes)
                return -1;
        }

-       if (endpoint->EPSTATUS.bit.BK1RDY) {
-               /* Write in progress, drop */
-               return -EAGAIN;
-       }
+    s64_t reftime = k_uptime_get();
+    while (endpoint->EPSTATUS.bit.BK1RDY) {
+        // previous transfer is still not complete
+        if ((k_uptime_get() - reftime) >= 1) {
+            return -EAGAIN;
+        }
+    }

I have observed the warning but it is extremely rare now and occurs under dubious circumstances.

Perhaps the same issue exists in the nordic driver as well?

With regard to suppressing ZLP's, doesn't that violate the USB spec? As I recall, ZLP's are an essential part of the protocol?

sslupsky commented 4 years ago

With regard to the nordic driver, the underlying code that generates the warning is:

    /* USBD driver does not allow scheduling multiple DMA transfers
     * for one EP at a time. Next USB transfer on this endpoint can be
     * triggered after the completion of previous one.
     */
    if (ep_ctx->write_in_progress) {
        k_mutex_unlock(&ctx->drv_lock);
        return -EAGAIN;
    }

Similar to the sam0, if a write is on-going, the driver refuses to process the write, so called "drop" in the sam0 case.

It appears to me there is an underlying assumption that ep_write() is called only if the hardware is ready to receive it? If that is the case, could there be a situation where a ZLP (which is normal at the end of the transaction) is sent "early"?

sslupsky commented 4 years ago

@jfischer-phytec-iot @binkboyd There may be a race condition. The transfer status flag does not track the hardware state. Rather, it tracks the driver state. In most cases, this appears to be ok because of inherent delays using the kernel work queue. However, it appears to me that a usb_transfer() can overlap with a transfer that immediately follows at the hardware level. That is to say, usb_transfer() can queue a transfer before the hardware has finished sending the current packet. In this event, the transfer status is not busy, the hardware status is busy, the transfer is queued and therefore the warning is issued when usb_dc_ep_write() is executed.

Is the preferred solution to:

  1. insert a busy wait, or
  2. update usb_transfer_is_busy() to check the hardware status as well. This would require an additional function in usb_dc_xxx to return the hardware TX status.
jfischer-no commented 4 years ago

@sslupsky There may be a race condition, bug, bad code, or other pain in the stack, but again The sam0 driver is buggy and it does not make any sense to discus anything in the stack self until the driver passes all testusb tests. Let me or someone else fix it first, even there is a issue in the stack it does not make any sense to test it with a buggy driver.

KubaFYI commented 4 years ago

Hi @sslupsky have you had any success in fixing this?

nzmichaelh commented 4 years ago

Hi folks. FYI, I'm working on on the SAM0 driver and will get it to pass testusb.

re: the driver returning -EAGAIN, the dw, stm32, nrfx, and sam0 drivers all use this as 'a transfer is currently going on, please try again'. usb_write and the log message seem to be incorrect, as the API doesn't require the call to block if there's an ongoing transmission - instead the caller is supposed to wait for the callback.

nzmichaelh commented 4 years ago

An update: so far I've fixed the buffer overflow and attach / detach memory leak. testusb now shows:

/dev/bus/usb/003/076 test 0,    0.000008 secs
/dev/bus/usb/003/076 test 1,    1.667441 secs
/dev/bus/usb/003/076 test 2,    1.935824 secs
/dev/bus/usb/003/076 test 3,    1.691187 secs
/dev/bus/usb/003/076 test 4,    1.960423 secs
/dev/bus/usb/003/076 test 5,   45.796623 secs
/dev/bus/usb/003/076 test 6,   51.611663 secs
/dev/bus/usb/003/076 test 7,   45.779666 secs
/dev/bus/usb/003/076 test 8,   51.613959 secs
/dev/bus/usb/003/076 test 9,    1.186737 secs
/dev/bus/usb/003/076 test 10 --> 32 (Broken pipe)
/dev/bus/usb/003/076 test 11,   16.278739 secs
/dev/bus/usb/003/076 test 12,   16.038821 secs
/dev/bus/usb/003/076 test 13 --> 22 (Invalid argument)
/dev/bus/usb/003/076 test 14 --> 22 (Invalid argument)
/dev/bus/usb/003/076 test 17,    1.719775 secs
/dev/bus/usb/003/076 test 18,    1.955908 secs
/dev/bus/usb/003/076 test 19,    1.684855 secs
/dev/bus/usb/003/076 test 20,    1.946822 secs
/dev/bus/usb/003/076 test 21 --> 22 (Invalid argument)
/dev/bus/usb/003/076 test 24,   46.007422 secs
/dev/bus/usb/003/076 test 27,   44.619108 secs
/dev/bus/usb/003/076 test 28,   51.224660 secs
/dev/bus/usb/003/076 test 29,    0.800093 secs

10,14, and 21 are all control endpoint related. 13 is a halt test. Getting closer...

jfischer-no commented 4 years ago

I am afraid it is not easy to fix the control endpoint issues. I spend time to investigate it. If I remember correctly, the controller is sometime to fast at the end of a data stage, accept new setup token and overwrites the buffer. You could try to use double buffering. Refactor all the drivers and stack to use zero buffer copy or minimize the number of copies would probably relax this issue too.

sslupsky commented 4 years ago

@jfischer-phytec-iot @nzmichaelh @KubaFYI I believe the wait on the BK1RDY flag I demonstrated above ensures the hardware buffer is ready before it gets overwritten.

In my case, I never encounter the failed to write endpoint buffer error since making this change.

nzmichaelh commented 4 years ago

testusb passes now. Tests 14 and 21 have a bug on the Linux side - testusb runs the test with length=1024 and vary=1024 which is invalid and hits this check in usbtest.c:ctrl_out:

    if (length < 1 || length > 0xffff || vary >= length)
        return -EINVAL;

Limiting the vary fixes this:

$ sudo ~/src/linux-source-5.4.0/tools/usb/testusb -v 512 -D /dev/bus/usb/003/084 
/home/michaelh/src/linux-source-5.4.0/tools/usb/testusb: /dev/bus/usb/003/084 may see only control tests
/dev/bus/usb/003/084 test 0,    0.000009 secs
/dev/bus/usb/003/084 test 1,    2.086418 secs
/dev/bus/usb/003/084 test 2,    2.001619 secs
/dev/bus/usb/003/084 test 3,    1.118426 secs
/dev/bus/usb/003/084 test 4,    1.025461 secs
/dev/bus/usb/003/084 test 5,   54.400337 secs
/dev/bus/usb/003/084 test 6,   57.145302 secs
/dev/bus/usb/003/084 test 7,   28.483353 secs
/dev/bus/usb/003/084 test 8,   29.611805 secs
/dev/bus/usb/003/084 test 9,    0.983739 secs
/dev/bus/usb/003/084 test 10,    4.006688 secs
/dev/bus/usb/003/084 test 11,   16.909420 secs
/dev/bus/usb/003/084 test 12,   16.823115 secs
/dev/bus/usb/003/084 test 13,    5.888045 secs
/dev/bus/usb/003/084 test 14,    0.951407 secs
/dev/bus/usb/003/084 test 17,    2.135497 secs
/dev/bus/usb/003/084 test 18,    2.021155 secs
/dev/bus/usb/003/084 test 19,    2.070156 secs
/dev/bus/usb/003/084 test 20,    2.023623 secs
/dev/bus/usb/003/084 test 21,    1.024373 secs
/dev/bus/usb/003/084 test 24,   60.867539 secs
/dev/bus/usb/003/084 test 27,   60.816679 secs
/dev/bus/usb/003/084 test 28,   56.888685 secs
/dev/bus/usb/003/084 test 29,    0.807269 secs
nzmichaelh commented 4 years ago

The fixes are on https://github.com/nzmichaelh/zephyr/tree/usb_dev. I've sent PRs for the first two.

nzmichaelh commented 4 years ago

@sslupsky re: spinning until free, I agree that that will fix the problem, but it's not clear from the API what the driver should do if the buffer is full and I'd rather not spin if this is called from an interrupt handler.

Details:

include/usb/usb_device.h doesn't say what will happen if the TX is full, or if it's safe to call from an interrupt handler. Different drivers do different things:

Any driver that yields is not interrupt safe, and in theory isn't compatible with samples/testusb which does a write from the status interrupt.

sslupsky commented 4 years ago

@nzmichaelh I think in the case of sam0 (and likely others too), the buffer is being updated before the data is actually transferred out. The samd21 datasheet teaches that the BK1RDY flag needs to be checked to ensure the buffer can be written. So, I put a 1ms timeout on the spin which corresponds to the 1ms usb frame time. I believe you are guaranteed that the buffer will be empty within the usb frame time unless there is a hardware failure or disconnect during the transmission of a frame. Nevertheless, I agree that spinning for 1ms in an ISR is a long period of time and should be avoided. So, how did you address this issue? I could not see from the link how you fixed this issue.

Perhaps @carlescufi can organize a discussion about the API issue to resolve the ambiguity so this issue can be properly resolved.

sslupsky commented 4 years ago

@nzmichaelh Thank you for the attention on this issue.

carlescufi commented 4 years ago

@nzmichaelh I think in the case of sam0 (and likely others too), the buffer is being updated before the data is actually transferred out. The samd21 datasheet teaches that the BK1RDY flag needs to be checked to ensure the buffer can be written. So, I put a 1ms timeout on the spin which corresponds to the 1ms usb frame time. I believe you are guaranteed that the buffer will be empty within the usb frame time unless there is a hardware failure or disconnect during the transmission of a frame. Nevertheless, I agree that spinning for 1ms in an ISR is a long period of time and should be avoided. So, how did you address this issue? I could not see from the link how you fixed this issue.

Perhaps @carlescufi can organize a discussion about the API issue to resolve the ambiguity so this issue can be properly resolved.

@sslupsky Yes, I'd be glad to. But first, could you create a separate issue describing the API ambiguity please?

sslupsky commented 4 years ago

@carlescufi ok. I can do that.

nzmichaelh commented 4 years ago

Note that the SAM0 USB fixes have landed. The SAM0 now passes usbtest.

petejohanson commented 4 years ago

I've just encountered this on a project I'm working on as well, my test hardware is nrf52840 based. I can confirm that replacing the k_yield() with the proposed k_usleep(500L) makes a huge difference in fixing this here.

carlescufi commented 3 years ago

@emob-nordic @jfischer-phytec-iot is this still an issue? Otherwise please close it.