zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.14k stars 6.23k forks source link

SAM V71B Initial USB Transfer Drops Data Bytes #48671

Closed nick-kraus closed 1 year ago

nick-kraus commented 1 year ago

Describe the bug After a reboot, the first instance of writing data from the device (sam_v71b_xult board, usbhs) to a host (PC) gets dropped and no data is available to be read from the host. After the first write, subsequent writes work as expected, even when resetting and re-configuring the usb port.

Problem was originally discovered when developing a custom firmware which acts as a usb composite device, saw same behavior on both a raw bulk endpoint loopback, and a configured cdc acm interface which was looped back over a uart interface (where the serial emulator app wouldn't register the first byte sent to the device after a reboot). In both these cases, with additional logging enabled, the board usb subsystem saw the incoming usb out packet from the host, did its processing (loopback), and as far as the board knows correctly wrote a usb in packet of the same data and size.

The same behavior is not seen on a nucleo_h743zi dev board with the same codebase.

To Reproduce Steps to reproduce the behavior:

  1. build and flash the webusb sample onto an Atmel sam_v71b_xult dev board
  2. reset the dev board
  3. use the below python and pyusb script to view the buggy behavior

Expected behavior Even after a full device reboot, the first usb out transaction should be properly looped back and able to be read by the host.

Impact Major annoyance for proper device usage and testing, as any time the usb is used it now needs to be 'primed' with a write before doing any useful data, as we don't have a way of knowing after connecting if the board has been freshly rebooted.

Logs and console output The webusb sample has been modified to have additional debug logging on the usb device. In the below snippet both of the usb loopback transfers appear identical to the device.

[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer_work: Transfer done, ep 0x02, status 0, size 3←[0m
[00:00:46.470,000] ←[0m<dbg> webusb: webusb_read_cb: cfg 0x204001a0 ep 2 size 3←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer: Transfer start, ep 0x81, data 0x204047e4, dlen 3←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer: Transfer start, ep 0x02, data 0x204047e4, dlen 512←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer_work: Transfer done, ep 0x81, status 0, size 3←[0m
[00:00:46.470,000] ←[0m<dbg> webusb: webusb_write_cb: ep 81 size 3←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer_work: Transfer cancelled or completed, ep 0x02←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer_work: Transfer done, ep 0x02, status 0, size 3←[0m
[00:00:46.470,000] ←[0m<dbg> webusb: webusb_read_cb: cfg 0x204001a0 ep 2 size 3←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer: Transfer start, ep 0x81, data 0x204047e4, dlen 3←[0m
[00:00:46.470,000] ←[0m<dbg> usb_transfer: usb_transfer: Transfer start, ep 0x02, data 0x204047e4, dlen 512←[0m
[00:00:46.471,000] ←[0m<dbg> usb_transfer: usb_transfer_work: Transfer done, ep 0x81, status 0, size 3←[0m
[00:00:46.471,000] ←[0m<dbg> webusb: webusb_write_cb: ep 81 size 3←[0m

Environment (please complete the following information):

Additional context

# python script used to cause buggy behavior
import usb.core
dev = usb.core.find(idVendor=0x2fe3, idProduct=0x000a)
# replace 0x02 and 0x81 with the necessary endpoint address, if they differ from here
# below will always print '3'
print(f'wrote {dev.write(0x02, [1, 2, 3])} bytes to usb')
# below will not print the loopback data after initial board boot or reset
# but it will print the loopback data correctly on the second run of the script (without a reset)
print(f'received {dev.read(0x81, 3)} back from usb')
# below will always print '3'
print(f'wrote {dev.write(0x02, [4, 5, 6])} bytes to usb')
# below will always print '[4, 5, 6]', even when above fails, so the data isn't just delayed
print(f'received {dev.read(0x81, 3)} back from usb')
nick-kraus commented 1 year ago

Hi @nandojve , I've been trying to look into and diagnose what exactly is happening here, and I think I've found what is going on. It looks like the pointer to the endpoint FIFO for any given endpoint isn't being initialized until after the first write attempt, causing the first write transfer on any endpoint to be dropped.

In the existing code, when usb_dc_ep_write() is called for the first time on any given IN endpoint, the FIFO pointer is null, as seen in the logs below ('into fifo ((nil))...'):

[00:02:43.402,000] <wrn> usb_transfer: Transfer start, ep 0x81, data 0x20402798, dlen 4
[00:02:43.402,000] <wrn> usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512
[00:02:43.402,000] <wrn> usb_dc_sam_usbhs: putting 4 bytes of data into fifo ((nil)) of ep 0x81
[00:02:43.402,000] <wrn> usb_transfer: Transfer done, ep 0x81, status 0, size 4
[00:02:43.403,000] <wrn> usb_transfer: Transfer done, ep 0x02, status 0, size 3
[00:02:43.403,000] <wrn> usb_transfer: Transfer start, ep 0x81, data 0x20402798, dlen 3
[00:02:43.403,000] <wrn> usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512
[00:02:43.403,000] <wrn> usb_dc_sam_usbhs: putting 3 bytes of data into fifo (0xa0108000) of ep 0x81
[00:02:43.403,000] <wrn> usb_transfer: Transfer done, ep 0x81, status 0, size 3
[00:02:43.403,000] <wrn> usb_transfer: Transfer done, ep 0x02, status 0, size 2
[00:02:43.403,000] <wrn> usb_transfer: Transfer start, ep 0x81, data 0x20402798, dlen 2
[00:02:43.403,000] <wrn> usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512
[00:02:43.403,000] <wrn> usb_dc_sam_usbhs: putting 2 bytes of data into fifo (0xa0108000) of ep 0x81
[00:02:43.403,000] <wrn> usb_transfer: Transfer done, ep 0x81, status 0, size 2

By simply adding a check to see if the fifo pointer is null before filling, and if so resetting the endpoint fifo, it seems to have now corrected the incorrect behavior. I unfortunately haven't looked through the code enough to understand if this is the proper or best way to fix this issue, I mostly tried to isolate the issue and hack in whatever solution I could.

I would be more than happy to do the work to resolve this issue and get a fix into a pull request, but I would appreciate some guidance if possible by someone who knows this particular code better.

I have added the code I used to test / diagnose / verify what I have found to this branch on my fork of zephyr. The most recent commit contains a hacky fix to the problem, and the commit before contains a python test script to help show the behavior and additional log statements that help to diagnose it.

nandojve commented 1 year ago

Hi @nick-kraus ,

One thing that help me in the past was ran testusb. I`ll see if I can check this issue on the weekend once it seems you maybe found the root of problem.

nick-kraus commented 1 year ago

Hi @nandojve,

I spend some time this weekend trying to find any issues with the testusb sample, however I wasn't able to get the testusb application to run more than just the control transfer tests 0, 9, and 10; which all passed (even after following the directions where you specify usbtest uses the "Gadget Zero" interface). Unfortunately I only had access to a Linux VM (Ubuntu 22.04) and not a dedicated Linux box itself, which may have played a role in the difficulties I was having (connecting / keeping the device connected to the VM ended up rather flaky).

After messing around further with zephyr testusb sample, I found I wasn't even able to get the simple loopback functionality to work at all. The device enumerated with a single BULK OUT endpoint at address 0x02 and a single BULK IN endpoint at address 0x81, and while I was able to write data to the OUT endpoint, I couldn't ever receive data back from the IN endpoint (any reads timed out, indicating that no data was ever written back out to that endpoint). I am not sure if I was just doing something egregiously wrong here or not, if you could verify (or not) this behavior that would be appreciated.

Regardless, I am still able to reliably recreate the original issue using the webusb sample (modified to add extra logging), which also just uses a usb loopback (but with a different code path). I have included a python script that shows the failure I'm describing (and how it works successfully after my hacky fix); it uses the pyusb package for transferring to / from the endpoints and the pyocd package for ensuring the device is in a freshly restarted state each iteration.

Python test script
```python import time import usb.core from pyocd.core.helpers import ConnectHelper session = ConnectHelper.session_with_chosen_probe() session.open() session.board.target.reset() time.sleep(1) dev = usb.core.find(idVendor=0x2fe3, idProduct=0x000a) dev.write(0x02, [1, 2, 3]) if dev.read(0x81, 3).tobytes() != b'\x01\x02\x03': raise ValueError("test failed, data was not returned after fresh reboot") else: print("test success") ```

When this script is run on commit f15c4c237488ca819d87e295ffcac3ad4382330f of my repo (upstream zephyr main but with some extra logging statements) is throws the "test failed" error, because the first transfer doesn't fully loopback as expected, and the following logs are printed. As you can see, the address of the internal driver FIFO of ep 0x81 is NULL at the time of the first USB transfer, which is clearly a bug.

Failing test logs
```shell *** Booting Zephyr OS build zephyr-v3.1.0-2872-gf15c4c237488 *** [00:00:00.180,000] usb_dc_sam_usbhs: putting 18 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.181,000] usb_dc_sam_usbhs: putting 0 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.191,000] usb_dc_sam_usbhs: putting 18 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 32 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 64 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 0 bytes of data into fifo (0xa0100000) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 38 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 34 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 4 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.196,000] usb_dc_sam_usbhs: putting 42 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.717,000] usb_dc_sam_usbhs: putting 2 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.717,000] usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512 [00:00:00.717,000] usb_dc_sam_usbhs: putting 0 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.729,000] usb_dc_sam_usbhs: putting 4 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.729,000] usb_dc_sam_usbhs: putting 4 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.730,000] usb_dc_sam_usbhs: putting 14 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.730,000] usb_dc_sam_usbhs: putting 14 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.730,000] usb_dc_sam_usbhs: putting 42 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.731,000] usb_dc_sam_usbhs: putting 42 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.735,000] usb_dc_sam_usbhs: putting 17 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:01.082,000] usb_transfer: Transfer done, ep 0x02, status 0, size 3 [00:00:01.082,000] usb_transfer: Transfer start, ep 0x81, data 0x20402798, dlen 3 [00:00:01.082,000] usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512 [00:00:01.082,000] usb_dc_sam_usbhs: putting 3 bytes of data into fifo ((nil)) of ep 0x81 [00:00:01.082,000] usb_transfer: Transfer done, ep 0x81, status 0, size 3 ```

Re-running the script on commit fd596012c515a9217777e47ef7a6e02da91e1833 of my repo (hacky fix for bug, need guidance on how to improve) will produce the successful output, since the fix I made will check that the FIFO is a valid address (not NULL) when the usb_dc_ep_write is called, and if it is not valid then it will reset the FIFO (which sets it to the correct address). In the current zephyr repo it seems that this fifo is not being initialized early enough, since it is always NULL on the first call to usb_dc_ep_write for any given (non-control) endpoint. Below are the logs showing the first call with that endpoint showing a valid address.

Successful test logs
```shell *** Booting Zephyr OS build zephyr-v3.1.0-2873-gfd596012c515 *** [00:00:00.182,000] usb_dc_sam_usbhs: putting 18 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.182,000] usb_dc_sam_usbhs: putting 0 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.193,000] usb_dc_sam_usbhs: putting 18 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 32 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 64 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 0 bytes of data into fifo (0xa0100000) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 38 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 34 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 4 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.204,000] usb_dc_sam_usbhs: putting 42 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.211,000] usb_dc_sam_usbhs: putting 2 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.212,000] usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512 [00:00:00.212,000] usb_dc_sam_usbhs: putting 0 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.219,000] usb_dc_sam_usbhs: putting 4 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.219,000] usb_dc_sam_usbhs: putting 4 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.220,000] usb_dc_sam_usbhs: putting 14 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.220,000] usb_dc_sam_usbhs: putting 14 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.220,000] usb_dc_sam_usbhs: putting 42 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.220,000] usb_dc_sam_usbhs: putting 42 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:00.222,000] usb_dc_sam_usbhs: putting 17 bytes of data into fifo (0xa0100008) of ep 0x80 [00:00:01.069,000] usb_transfer: Transfer done, ep 0x02, status 0, size 3 [00:00:01.069,000] usb_transfer: Transfer start, ep 0x81, data 0x20402798, dlen 3 [00:00:01.069,000] usb_transfer: Transfer start, ep 0x02, data 0x20402798, dlen 512 [00:00:01.069,000] usb_dc_sam_usbhs: putting 3 bytes of data into fifo (0xa0108000) of ep 0x81 [00:00:01.069,000] usb_transfer: Transfer done, ep 0x81, status 0, size 3 ```

If there is any chance that you can recreate this issue, and then also recreate the working solution, with the above repo and scripts, that would be much appreciated. Once I have got some confirmation that this issue is reproducible for others and that my quick fix is working I can begin to look into an improved fix for this problem, and eventually open a PR.

nick-kraus commented 1 year ago

I ended up finding a good place to put the FIFO reset / initialization call, which resolved the issue, so I ended up creating pr #49888.