xmos / lib_xud

XMOS USB code and associated examples
Other
8 stars 25 forks source link

Illegal resource exception in XUD_ResetEndpoint #356

Closed danielpieczko closed 1 year ago

danielpieczko commented 2 years ago

Apps from sw_usb_audio are intermittently seeing illegal resource exceptions from XUD_ResetEndpoint. This occurs when setting the sample rate while starting an in or out audio stream shortly after the device (the exception hasn't been seen with a delay between startup).

xrun: Program received signal ET_ILLEGAL_RESOURCE, Resource exception.
      [Switching to tile[1] core[0]]
      XUD_ResetEndpoint (one=302920, two=@0x4a708) at /Users/jenkins/daniel/usb_audio/lib_xud/lib_xud/src/user/client/XUD_EpFunctions.xc:79

      79            asm volatile ("inct %0, res[%1]": "=r"(busStateCt):"r"(c1)); // busStateCt = inct(one);

The illegal resource exception is raised because a data token is read in when a control token is expected.

The state of all the threads at the point of the exception: inct-illegal-resource.txt

USB protocol capture of this sequence (start from index 2008 where the previously running app was killed leading to a corrupted packet) reset-endpoint-exception.zip

danielpieczko commented 2 years ago

USB protocol capture from a successful test which set the sample rate and started playing an analogue signal: pass0-short.zip

danielpieczko commented 2 years ago

Re-upload of the USB capture from the failure case with the beginning truncated to the relevant point: fail0-short.zip

danielpieczko commented 2 years ago

Interesting packets from the trace in fail0-short: #7477 @ 00:03.897.740 SETUP - Set Cur Sampling Frequency #7481 @ 00:03.897.745 OUT #9115 @ 00:09.027.021 SETUP - Set Cur Sampling Frequency as in #7477, just over 5s later #9119 @ 00:09.027.100 OUT (Nak)

The equivalent sequence from pass0-short: #6499 @ 00:03.857.858 SETUP - Set Cur Sampling Frequency #6503 @ 00:03.857.863 OUT #7450 @ 00:08.845.690 IN - long gap between OUT and IN, but just under 5s #7472 @ 00:08.847.985 SETUP - Set Cur Sampling Frequency (again) #7477 @ 00:08.847.990 OUT #7481 @ 00:08.868.005 IN - this is just 20ms after the SETUP

In both cases there is a longer than expected gap between the OUT and the IN. Maybe the host resends the SETUP after five seconds, and in the failing case we haven't sent the IN yet and we don't handle that well, whereas in the passing case the IN is sent just within five seconds - perhaps the host still resends the SETUP as we took too long to respond but at least we don't fall over because we aren't in the process of handling the previous one.

xross commented 2 years ago

The trace is easier to look out if you use LiveFilter to filter for EP0 only. We have to be careful because the viewer also re-arranges some of the traffic trying to be useful.. for example see the times here going from 9 to 11 then back to 9..

fail0

Right clicking and "expanding all" does help a bit. But the PING-NAKs are still shown in the wrong place. Corrected here:

fail0a

In the fail0_short trace the SetCur() to set the sample rate to 96kHz (@7477) is missing the IN status stage from the host. (looking at the code I am surprised to see that the next SETUP is ACK'ed by the device). Things are out of sync between the host and device at this point (assuming trace is accurate - see next comment)

I'm not immediately sure how this manifests itself in a trap in reset endpoint. I feel we need a directed test for the sequence:

Host: SETUP Host: DATA0 (8 bytes of junk data) Host expects: ACK

Host: OUT Host: DATA0 (n bytes of junk data) Host expects: ACK

Host: SETUP Host: DATA0 (8 bytes of junk data) Host expects: ACK

Host: BUS RESET

These test should also look to close testing hole https://github.com/xmos/lib_xud/issues/330

xross commented 2 years ago

As discussed there are two issues:

My previous comment addresses the investigation into the latter. On the former..

I would really like to see some NAKs in this trace! I cannot seem to see where they are filtered out. In "Capture Settings" could you select "Data-less Sequences" and see if this makes a difference? Without this info it's possible the missing "IN" theory is actually a red herring. The host might be sending it but its never shown on the fail trace as it is never ACK'ed.....

If the device is found to be nacking for 5s then that's cause for concern. Possibly it also modifies the test sequence mentioned above.

xross commented 2 years ago

(looking at the code I am surprised to see that the next SETUP is ACK'ed by the device).

This is actually due to a small bug in the intended functionality here: https://github.com/xmos/lib_xud/blob/011a010d036eef8b641995fb70aa70d3f64145e7/lib_xud/src/core/included/XUD_Token_Setup_DI.S#L42

It should use an offset by 32 as per L13.

The resetting endpoint received the length for the next SETUP instead of a CT and this causes the trap.

darasan commented 1 year ago

Hi @xross, has this fix been tested on XS2 devices as well as XS3? We are seeing a similar issue on one of our XS2-based products that uses the sw_usb_audio reference design, with the latest lib_xud and this fix added.

If we play audio on output channels 1 and 2 (other channels seem not to trigger it) and then change the sample rate while audio is streaming, we get:

Program received signal ET_ILLEGAL_RESOURCE, Resource exception. [Switching to tile[0] core[4]] Pid_In () at XUD_CrcAddrCheck.S:34 in XUD_CrcAddrCheck.S

We do not have the issue using the previous version of XUD, or on another XS3-based product that we have. Could it be related to this issue? Thanks

xross commented 1 year ago

Hi @darasan

Yes, we regress the USB audio design on both XS2 (xCORE-200) and XS3 (xCORE.ai) based devices. However, this fix has not made it into a release yet so has not been fully regressed in this design. This will be done soon.

What we don't have is the XUD simulation based regression for XS2 - we only added USB support for XS3 in xsim.

Audio channels shouldn't matter - streaming is all or no channels. Perhaps some buffer flow could cause some channel-data based memory corruption but seems unlikely.

Please raise a new issue for this one. Although it could be related to the fix of this issue it sounds like an issue in its own right.

So you have the ability to provide a USB trace? If so please attach one to the new issue.

darasan commented 1 year ago

OK, thanks for the feedback. Agreed about the audio channels - shouldn't matter but it seems to. We'll test some more and then open a new issue with the details. Yes we can provide a trace (Ellisys analyser).

xross commented 1 year ago

OK, thanks for the feedback. Agreed about the audio channels - shouldn't matter but it seems to. We'll test some more and then open a new issue with the details. Yes we can provide a trace (Ellisys analyser).

Great, that's actually my preferred format! Please also provide detail on what host OS you are running.

xross commented 1 year ago

Closing this one. An issue in lib_xua was causing the device to NAK for a long period causing the host to "give up" on the device. This lib_xud issue was then encountered in the following reset from the host. A test case has been authored.