zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.77k stars 6.57k forks source link

Ringbuffer used in CDC_ACM seems to corrupt data if completely filled during transfer #42031

Closed kyrreaa closed 2 years ago

kyrreaa commented 2 years ago

Describe the bug Trying to maximize throughput in CDC_ACM by monitoring space in ringbuffer and pushing more data in with uart_fifo_fill() whenever there is enough room for another block read from source appears to corrupt data. Running a ACM ringbuffer of 24576 and a write block size of 8192 (1/3) with double or even quad buffering appears to corrupt. Waiting for space 8192+1 is ok.

To Reproduce To reproduce, monitor ring buffer to call uart_fifo_fill() when block will just fit and it corrupts data. Repeat with block+1 and it is fine.

Expected behavior Expectation was that buffer could be filled completely even while transfer was ongoing by work handler.

Impact Workaround has been found, mostly an annoyance.

Environment (please complete the following information):

Additional context Problem encountered during optimizing of throughput on USB virtual serial port.

jfischer-no commented 2 years ago

@kyrreaa Can you please provide minimal changes to reproduce it?

kyrreaa commented 2 years ago

I will try again, but currently issue is the cdc_acm example does not build for nrf53 target (the image preparation fails after all build stages are ok) fails in their SDK.

carlescufi commented 2 years ago

@kyrreaa have you modified the CDC ACM implementation in Zephyr? Also, there is currently no protection from multiple concurrent writer access, so you should only call uart_fifo_fill() from a single thread, or serialize its access.

kyrreaa commented 2 years ago

I will try again to get a sample working, but until I do we may consider this a buffer issue "before" it reaches the ring buffer.

I must say though that I found the throughput really poor since it keeps stopping the sequence to grab another bit of the fifo, after it creates a callback when done with previous "reserved" bit. Surprising to find it did not check the fifo again after a 64 byte block to see if it could update it's target and continue sending from the ring buffer. Instead it tries to avoid sending a ZLP and sends 63 bytes and a 1 byte packet which basically tells the host that there is no more data. Thus it waits for remaining time in 1ms interval till it requests more data. At which point it is grabbing a new reserved slice of the ringbuffer. Checking if a larger portion is now available as it gets close to the end of the assumed work would be more efficient. Being able to trigger a callback (interrupt) when a 64 byte block is sent and there is room for more than a certain amount in the buffer may be good. Like, writing up to 4k at a time, with a ring buffer of 12k allows more data to be put into the buffer 3 times immediately, and then if a callback could happen when 4k is again free, it could be kept filled for the entire transfer.

I have been using USBlyzer to look at the timing and blocks of data transferred and it does not appear to be transferring continuously.

jfischer-no commented 2 years ago

Instead it tries to avoid sending a ZLP and sends 63 bytes and a 1 byte packet which basically tells the host that there is no more data.

This is on purpose to deal with different applications on the host. Depending on the host controller/application it limits throughput to 64 bytes per frame (1ms, FS controller). Improvable but not a bug.

carlescufi commented 2 years ago

@kyrreaa at this point it is unclear to us whether there indeed is data corruption with the implementation in the tree, or only with your changes. Feel free to post a PR to improve the implementation with some of the suggestions in your comments, or to turn this into an enhancement if the corruption doesn't appear with the vanilla code. Thanks!

jfischer-no commented 2 years ago

I am closing it, please open it again with minimal example/changes to reproduce the bug.