Closed sslupsky closed 4 years ago
@sslupsky Please provide information how to reproduce with an in-tree sample or a diff with your changes in samples/basic/blinky
(?).
@jfischer-phytec-iot I believe you can use the samples/basic/blinky to reproduce the problem. I've added a low power state to my board and I forced the power state to active during the testing I described above, so I think it should not be affecting anything. Here is the diff for main.c:
11,21d10
< #include <sys/printk.h>
< #include <shell/shell.h>
< #include <shell/shell_uart.h>
< #include <version.h>
< #include <logging/log.h>
< #include <stdlib.h>
<
< #include <power/power.h>
< #include <usb/usb_device.h>
< #include <soc.h>
<
28,29d16
< LOG_MODULE_REGISTER(app);
<
39,40d25
< sys_pm_force_power_state(SYS_POWER_STATE_ACTIVE);
<
48,63d32
<
< static int cmd_sleep(const struct shell *shell, size_t argc, char **argv)
< {
< ARG_UNUSED(argc);
< ARG_UNUSED(argv);
<
< ADC->CTRLA.bit.ENABLE = 0;
< usb_dc_detach();
< sys_pm_force_power_state(SYS_POWER_STATE_AUTO);
<
< return 0;
< }
<
<
< // Create root command "sleep"
< SHELL_CMD_ARG_REGISTER(sleep, NULL, "enable power management", cmd_sleep, 1, 0);
Here is my prj.conf:
# Drivers
CONFIG_ADC=y
CONFIG_ADC_0=y
CONFIG_I2C=y
CONFIG_FLASH=y
CONFIG_USB_DEVICE_STACK=y
CONFIG_USB_CDC_ACM=y
CONFIG_USB_DEVICE_PRODUCT="Scanimetrics WiTAP ISB"
# Shell
CONFIG_SHELL=y
CONFIG_ADC_SHELL=y
CONFIG_GPIO_SHELL=y
CONFIG_FLASH_SHELL=y
CONFIG_UART_SHELL_ON_DEV_NAME="CDC_ACM_0"
# Logging
# CONFIG_PRINTK=y
CONFIG_LOG=y
# CPP support
#CONFIG_CPLUSPLUS=y
and this is my board defconfig:
# SPDX-License-Identifier: Apache-2.0
CONFIG_ARM=y
CONFIG_SOC_FAMILY_SAM0=y
CONFIG_SOC_SERIES_SAMD21=y
CONFIG_SOC_PART_NUMBER_SAMD21G18A=y
CONFIG_BOARD_SCANIMETRICS_WITAP_ISB=y
CONFIG_USE_CODE_PARTITION=y
CONFIG_SOC_ATMEL_SAMD_XOSC32K=y
CONFIG_SOC_ATMEL_SAMD_XOSC32K_AS_MAIN=y
CONFIG_CORTEX_M_SYSTICK=n
CONFIG_SAM0_RTC_TIMER=y
CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=32768
CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000
CONFIG_SYS_POWER_MANAGEMENT=y
CONFIG_SYS_POWER_SLEEP_STATES=y
CONFIG_SYS_PM_MIN_RESIDENCY_SLEEP_1=1
CONFIG_SYS_PM_MIN_RESIDENCY_SLEEP_2=10
CONFIG_CONSOLE=y
CONFIG_UART_CONSOLE=y
CONFIG_USB_UART_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_UART_SAM0=y
CONFIG_UART_INTERRUPT_DRIVEN=y
CONFIG_GPIO=y
CONFIG_GPIO_SAM0=y
CONFIG_WATCHDOG=y
CONFIG_WDT_SAM0=y
CONFIG_SPI=n
CONFIG_SPI_SAM0=n
CONFIG_PINMUX=y
CONFIG_PINMUX_SAM0=y
CONFIG_USB=y
CONFIG_USB_DC_SAM0=y
@nzmichaelh or @benpicco do you have any additional insight in this issue?
I looked into this issue further. It appears to be a general USB issue, not necessarily specific to the sam0. This issue manifests in usb_write() in usb_device.c. The function outputs a warning if it retries a write. I believe this should be a debug output instead of a warning. A check should be made after to determine if the write actually failed and if it does, a warning should be issued. Strictly speaking, this is likely unnecessary as the calling function will likely output an error.
Second, and likely more importantly, there is a race condition here. The numbers of retries is an arbitrary number and depending on the scheduler, the amount of time it takes to retry here is arbitrary. I found that making the message a debug (and turning off debug), reduced the time to the point that write failed quite often. Since the USB slot time is 1 ms, I recommend k_yield() be replaced with k_usleep(500L). This would delay the next retry for 1/2 the USB slot time.
existing usb_write():
int usb_write(u8_t ep, const u8_t *data, u32_t data_len, u32_t *bytes_ret)
{
int tries = CONFIG_USB_NUMOF_EP_WRITE_RETRIES;
int ret;
do {
ret = usb_dc_ep_write(ep, data, data_len, bytes_ret);
if (ret == -EAGAIN) {
LOG_WRN("Failed to write endpoint buffer 0x%02x", ep);
k_yield();
}
} while (ret == -EAGAIN && tries--);
return ret;
}
proposed change:
int usb_write(u8_t ep, const u8_t *data, u32_t data_len, u32_t *bytes_ret)
{
int tries = CONFIG_USB_NUMOF_EP_WRITE_RETRIES;
int ret;
do {
ret = usb_dc_ep_write(ep, data, data_len, bytes_ret);
if (ret == -EAGAIN) {
LOG_DBG("Failed to write endpoint buffer 0x%02x, %d retries remaining", ep, tries);
k_usleep(500L);
}
} while (ret == -EAGAIN && tries--);
if ( tries < 0 ) {
LOG_WRN("Failed to write endpoint buffer 0x%02x", ep);
}
return ret;
}
I looked into this issue further. It appears to be a general USB issue, not necessarily specific to the sam0. This issue manifests in usb_write() in usb_device.c. The function outputs a warning if it retries a write. I believe this should be a debug output instead of a warning.
No. It has to be a warning. If it happens for an endpoint other than control then something is fishy in the driver.
Second, and likely more importantly, there is a race condition here. The numbers of retries is an arbitrary number and depending on the scheduler, the amount of time it takes to retry here is arbitrary. I found that making the message a debug (and turning off debug), reduced the time to the point that write failed quite often. Since the USB slot time is 1 ms, I recommend k_yield() be replaced with k_usleep(500L). This would delay the next retry for 1/2 the USB slot time.
regardless, warning should not appear. The driver is buggy and it does not make any sense to discus anything in the stack self until the driver passes all testusb tests.
Facing same issue:
[00:00:38.034,088]
[00:00:38.034,118]
[00:00:38.034,118]
Using the below patch, based on https://github.com/zephyrproject-rtos/zephyr/issues/20250 we are no longer seeing the "Failed to write endpoint buffer" after transferring a few hundred megabytes. This is after getting it consistently along with data corruption previously. It should be noted that our use case mostly consists of single direction transfers from device to host, so high throuput from host to device has not been tested.
diff --git a/subsys/usb/class/cdc_acm.c b/subsys/usb/class/cdc_acm.c index 3f7082374d..0a3bcf25a4 100644 --- a/subsys/usb/class/cdc_acm.c +++ b/subsys/usb/class/cdc_acm.c @@ -303,7 +303,7 @@ static void tx_work_handler(struct k_work *work) LOG_DBG("Got %d bytes from ringbuffer send to ep %x", len, ep); - usb_transfer(ep, dev_data->tx_buf, len, USB_TRANS_WRITE, + usb_transfer(ep, dev_data->tx_buf, len, USB_TRANS_WRITE | USB_TRANS_NO_ZLP, cdc_acm_write_cb, dev_data); }
@binkboyd @jrananth1032 please open a new issue for this bug in CDC ACM class. This issue is specific to usb_dc_sam0.c driver.
@jfischer-phytec-iot I think I mentioned that this issue is not constrained to the sam0. Moreover, @jrananth1032 indicated he observed the problem on nRF.
Also, you mentioned earlier
The driver is buggy and it does not make any sense to discus anything in the stack self until the driver passes all testusb tests.
Are you implying that the sam0 driver does not pass the tests? What is failing?
Notwithstanding my comments above, I have found that the existing driver assumes that if the endpoint is busy, a write is in progress and it drops the frame. Why does the driver make that assumption? It seems to me that there should be a timeout waiting for the previous transfer to complete? The following patch to the usb_dc_sam0.c driver appears to dramatically reduce the frequency of the warning:
@@ -504,10 +504,13 @@ int usb_dc_ep_write(u8_t ep, const u8_t *buf, u32_t len, u32_t *ret_bytes)
return -1;
}
- if (endpoint->EPSTATUS.bit.BK1RDY) {
- /* Write in progress, drop */
- return -EAGAIN;
- }
+ s64_t reftime = k_uptime_get();
+ while (endpoint->EPSTATUS.bit.BK1RDY) {
+ // previous transfer is still not complete
+ if ((k_uptime_get() - reftime) >= 1) {
+ return -EAGAIN;
+ }
+ }
I have observed the warning but it is extremely rare now and occurs under dubious circumstances.
Perhaps the same issue exists in the nordic driver as well?
With regard to suppressing ZLP's, doesn't that violate the USB spec? As I recall, ZLP's are an essential part of the protocol?
With regard to the nordic driver, the underlying code that generates the warning is:
/* USBD driver does not allow scheduling multiple DMA transfers
* for one EP at a time. Next USB transfer on this endpoint can be
* triggered after the completion of previous one.
*/
if (ep_ctx->write_in_progress) {
k_mutex_unlock(&ctx->drv_lock);
return -EAGAIN;
}
Similar to the sam0, if a write is on-going, the driver refuses to process the write, so called "drop" in the sam0 case.
It appears to me there is an underlying assumption that ep_write() is called only if the hardware is ready to receive it? If that is the case, could there be a situation where a ZLP (which is normal at the end of the transaction) is sent "early"?
@jfischer-phytec-iot @binkboyd There may be a race condition. The transfer status flag does not track the hardware state. Rather, it tracks the driver state. In most cases, this appears to be ok because of inherent delays using the kernel work queue. However, it appears to me that a usb_transfer() can overlap with a transfer that immediately follows at the hardware level. That is to say, usb_transfer() can queue a transfer before the hardware has finished sending the current packet. In this event, the transfer status is not busy, the hardware status is busy, the transfer is queued and therefore the warning is issued when usb_dc_ep_write() is executed.
Is the preferred solution to:
@sslupsky There may be a race condition, bug, bad code, or other pain in the stack, but again The sam0 driver is buggy and it does not make any sense to discus anything in the stack self until the driver passes all testusb tests. Let me or someone else fix it first, even there is a issue in the stack it does not make any sense to test it with a buggy driver.
Hi @sslupsky have you had any success in fixing this?
Hi folks. FYI, I'm working on on the SAM0 driver and will get it to pass testusb.
re: the driver returning -EAGAIN
, the dw, stm32, nrfx, and sam0 drivers all use this as 'a transfer is currently going on, please try again'. usb_write
and the log message seem to be incorrect, as the API doesn't require the call to block if there's an ongoing transmission - instead the caller is supposed to wait for the callback.
An update: so far I've fixed the buffer overflow and attach / detach memory leak. testusb now shows:
/dev/bus/usb/003/076 test 0, 0.000008 secs
/dev/bus/usb/003/076 test 1, 1.667441 secs
/dev/bus/usb/003/076 test 2, 1.935824 secs
/dev/bus/usb/003/076 test 3, 1.691187 secs
/dev/bus/usb/003/076 test 4, 1.960423 secs
/dev/bus/usb/003/076 test 5, 45.796623 secs
/dev/bus/usb/003/076 test 6, 51.611663 secs
/dev/bus/usb/003/076 test 7, 45.779666 secs
/dev/bus/usb/003/076 test 8, 51.613959 secs
/dev/bus/usb/003/076 test 9, 1.186737 secs
/dev/bus/usb/003/076 test 10 --> 32 (Broken pipe)
/dev/bus/usb/003/076 test 11, 16.278739 secs
/dev/bus/usb/003/076 test 12, 16.038821 secs
/dev/bus/usb/003/076 test 13 --> 22 (Invalid argument)
/dev/bus/usb/003/076 test 14 --> 22 (Invalid argument)
/dev/bus/usb/003/076 test 17, 1.719775 secs
/dev/bus/usb/003/076 test 18, 1.955908 secs
/dev/bus/usb/003/076 test 19, 1.684855 secs
/dev/bus/usb/003/076 test 20, 1.946822 secs
/dev/bus/usb/003/076 test 21 --> 22 (Invalid argument)
/dev/bus/usb/003/076 test 24, 46.007422 secs
/dev/bus/usb/003/076 test 27, 44.619108 secs
/dev/bus/usb/003/076 test 28, 51.224660 secs
/dev/bus/usb/003/076 test 29, 0.800093 secs
10,14, and 21 are all control endpoint related. 13 is a halt test. Getting closer...
I am afraid it is not easy to fix the control endpoint issues. I spend time to investigate it. If I remember correctly, the controller is sometime to fast at the end of a data stage, accept new setup token and overwrites the buffer. You could try to use double buffering. Refactor all the drivers and stack to use zero buffer copy or minimize the number of copies would probably relax this issue too.
@jfischer-phytec-iot @nzmichaelh @KubaFYI I believe the wait on the BK1RDY flag I demonstrated above ensures the hardware buffer is ready before it gets overwritten.
In my case, I never encounter the failed to write endpoint buffer error since making this change.
testusb passes now. Tests 14 and 21 have a bug on the Linux side - testusb runs the test with length=1024 and vary=1024 which is invalid and hits this check in usbtest.c:ctrl_out:
if (length < 1 || length > 0xffff || vary >= length)
return -EINVAL;
Limiting the vary fixes this:
$ sudo ~/src/linux-source-5.4.0/tools/usb/testusb -v 512 -D /dev/bus/usb/003/084
/home/michaelh/src/linux-source-5.4.0/tools/usb/testusb: /dev/bus/usb/003/084 may see only control tests
/dev/bus/usb/003/084 test 0, 0.000009 secs
/dev/bus/usb/003/084 test 1, 2.086418 secs
/dev/bus/usb/003/084 test 2, 2.001619 secs
/dev/bus/usb/003/084 test 3, 1.118426 secs
/dev/bus/usb/003/084 test 4, 1.025461 secs
/dev/bus/usb/003/084 test 5, 54.400337 secs
/dev/bus/usb/003/084 test 6, 57.145302 secs
/dev/bus/usb/003/084 test 7, 28.483353 secs
/dev/bus/usb/003/084 test 8, 29.611805 secs
/dev/bus/usb/003/084 test 9, 0.983739 secs
/dev/bus/usb/003/084 test 10, 4.006688 secs
/dev/bus/usb/003/084 test 11, 16.909420 secs
/dev/bus/usb/003/084 test 12, 16.823115 secs
/dev/bus/usb/003/084 test 13, 5.888045 secs
/dev/bus/usb/003/084 test 14, 0.951407 secs
/dev/bus/usb/003/084 test 17, 2.135497 secs
/dev/bus/usb/003/084 test 18, 2.021155 secs
/dev/bus/usb/003/084 test 19, 2.070156 secs
/dev/bus/usb/003/084 test 20, 2.023623 secs
/dev/bus/usb/003/084 test 21, 1.024373 secs
/dev/bus/usb/003/084 test 24, 60.867539 secs
/dev/bus/usb/003/084 test 27, 60.816679 secs
/dev/bus/usb/003/084 test 28, 56.888685 secs
/dev/bus/usb/003/084 test 29, 0.807269 secs
The fixes are on https://github.com/nzmichaelh/zephyr/tree/usb_dev. I've sent PRs for the first two.
@sslupsky re: spinning until free, I agree that that will fix the problem, but it's not clear from the API what the driver should do if the buffer is full and I'd rather not spin if this is called from an interrupt handler.
Details:
include/usb/usb_device.h
doesn't say what will happen if the TX is full, or if it's safe to call from an interrupt handler. Different drivers do different things:
Any driver that yields is not interrupt safe, and in theory isn't compatible with samples/testusb which does a write from the status interrupt.
@nzmichaelh I think in the case of sam0 (and likely others too), the buffer is being updated before the data is actually transferred out. The samd21 datasheet teaches that the BK1RDY flag needs to be checked to ensure the buffer can be written. So, I put a 1ms timeout on the spin which corresponds to the 1ms usb frame time. I believe you are guaranteed that the buffer will be empty within the usb frame time unless there is a hardware failure or disconnect during the transmission of a frame. Nevertheless, I agree that spinning for 1ms in an ISR is a long period of time and should be avoided. So, how did you address this issue? I could not see from the link how you fixed this issue.
Perhaps @carlescufi can organize a discussion about the API issue to resolve the ambiguity so this issue can be properly resolved.
@nzmichaelh Thank you for the attention on this issue.
@nzmichaelh I think in the case of sam0 (and likely others too), the buffer is being updated before the data is actually transferred out. The samd21 datasheet teaches that the BK1RDY flag needs to be checked to ensure the buffer can be written. So, I put a 1ms timeout on the spin which corresponds to the 1ms usb frame time. I believe you are guaranteed that the buffer will be empty within the usb frame time unless there is a hardware failure or disconnect during the transmission of a frame. Nevertheless, I agree that spinning for 1ms in an ISR is a long period of time and should be avoided. So, how did you address this issue? I could not see from the link how you fixed this issue.
Perhaps @carlescufi can organize a discussion about the API issue to resolve the ambiguity so this issue can be properly resolved.
@sslupsky Yes, I'd be glad to. But first, could you create a separate issue describing the API ambiguity please?
@carlescufi ok. I can do that.
Note that the SAM0 USB fixes have landed. The SAM0 now passes usbtest.
I've just encountered this on a project I'm working on as well, my test hardware is nrf52840 based. I can confirm that replacing the k_yield()
with the proposed k_usleep(500L)
makes a huge difference in fixing this here.
@emob-nordic @jfischer-phytec-iot is this still an issue? Otherwise please close it.
Describe the bug Build the blinky application with usb console and shell. When there is shell activity, there are frequent console warnings:
I have also noticed occasional dropped characters in the output stream.
To Reproduce Steps to reproduce the behavior:
The actual board is not the Adafruit board but my own custom board. I used the Adafruit board settings for my board.
Expected behavior No warnings.
Impact The console message is an annoyance. The dropped characters are a concern.
Screenshots or console output Here is a sample output of the warning message:
I also noticed in an earlier build of my blinky app that the output would occasionally drop characters. Here is a sample from that output (I added line to print the loop cnt value):
If you look closely at the output, you will see it occasionally drops a character.
Environment (please complete the following information):