xairy / raw-gadget

USB Raw Gadget โ€” a low-level interface for the Linux USB Gadget subsystem
333 stars 35 forks source link

missing -EBUSY/-EINTR inside raw_process_ep_io and raw_process_ep0_io #31

Closed patryk4815 closed 8 months ago

patryk4815 commented 1 year ago

Hi. After implementing my proxy( https://github.com/patryk4815/usb-proxy ) I tried to debug why "sometime" my code fails with proxying mass storage. It was really "random" (only occurs when transferring really large files), so after hours of debugging my code, it looks like the error is your module :(

ftrace from kernel:

       usb_proxy-5634    [000] ..... 35373.583990: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [001] ..... 35373.584716: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [002] ..... 35373.585358: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [000] ..... 35373.585993: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [000] ..... 35373.586636: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [002] ..... 35373.587320: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [002] ..... 35373.587994: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.588683: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.589416: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.590023: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.590693: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.591659: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.591701: usb_ep_dequeue+0x4/0xa8 <-raw_process_ep_io+0x198/0x4d0 [raw_gadget]
       usb_proxy-5634    [003] ..... 35373.591740: usb_ep_queue+0x4/0xc8 <-raw_process_ep_io+0x164/0x4d0 [raw_gadget]

call time:

 3) + 23.000 us   |  usb_ep_queue();
 3) * 14819.81 us |  usb_ep_dequeue();
 3)   9.352 us    |  usb_ep_queue();

dmesg:

kernel: raw-gadget.0 gadget: wait interrupted
kernel: dwc2 fe980000.usb: dwc2_hsotg_ep_sethalt(ep 00000000fe7f0a9a ep2in, 0)
kernel: dwc2 fe980000.usb: dwc2_hsotg_ep_sethalt(ep 00000000cf063d4c ep1out, 0)
kernel: raw-gadget.0 gadget: event fetching interrupted

After some time it randomly get usb_ep_dequeue. So i looked into raw-gadget code.

https://github.com/xairy/raw-gadget/blob/master/raw_gadget/raw_gadget.c#L1044

    ret = usb_ep_queue(ep->ep, ep->req, GFP_KERNEL);
    if (ret) {
        dev_err(&dev->gadget->dev,
                "fail, usb_ep_queue returned %d\n", ret);
        spin_lock_irqsave(&dev->lock, flags);
        dev->state = STATE_DEV_FAILED;
            // MISSING "ret = -EBUSY"?
        goto out_done;
    }

    ret = wait_for_completion_interruptible(&done);
    if (ret) {
        dev_dbg(&dev->gadget->dev, "wait interrupted\n");
        usb_ep_dequeue(ep->ep, ep->req);
        wait_for_completion(&done);
        spin_lock_irqsave(&dev->lock, flags);
                 // MISSING "ret = -EINTR"?
        goto out_done;
    }

I think there is missing ret = -EINTR. Not sure. But from USB_RAW_IOCTL_EP_WRITE or USB_RAW_IOCTL_EP_READ i should get response about "error" right? Now im getting "good" length writted/readed.

patryk4815 commented 1 year ago

Hmm, probably may issue is not about missing ret ๐Ÿค” . Now I see wait_for_completion_interruptible should return error ๐Ÿค” My main issue is that: whenwait interrupted happen my ioctl EPx_read/EPx_write is frozen for very long time (5-10s) - so maybe wait_for_completion(&done) taking too much time?

Or we need check returned error from usb_ep_dequeue(ep->ep, ep->req); ๐Ÿค”

What if (just wondering ๐Ÿค” ):

  1. usb_ep_dequeue returns error
  2. wait_for_completion is happening on "completed" task
  3. deadlock?
patryk4815 commented 1 year ago

I wrote this patch and now everything working fine: image

No issues after this patch. (except deadlock :trollface: on killing process )

I think better patch would be:

    do {
        ret = wait_for_completion_interruptible(&done);
        if (ret) {
            spin_lock_irqsave(&dev->lock, flags);
            if (dev->state != STATE_DEV_RUNNING) {
                dev_dbg(dev->dev, "wait interrupted, fail, device is not running\n");
                goto out_unlock;
            }
            spin_unlock_irqrestore(&dev->lock, flags);
        }
    }
    while (ret)
patryk4815 commented 1 year ago

btw i asked GPT AI where error ๐Ÿ‘€

Zrzut ekranu 2022-12-6 o 20 06 35 Zrzut ekranu 2022-12-6 o 20 10 52
xairy commented 1 year ago

Hi,

There's indeed an issue with returning a proper code when the request was interrupted, we probably need a fix for this.

However, I don't understand why the issue you mention is happening. usb_ep_dequeue is not checked, as it can fail in case the request has already been dequeued. And then its completion routine must have already been executed.

However, let's check if usb_ep_dequeue indeed fails.

Could you try running your test with the following patch:

diff --git a/raw_gadget/raw_gadget.c b/raw_gadget/raw_gadget.c
index f203fe2..a294589 100644
--- a/raw_gadget/raw_gadget.c
+++ b/raw_gadget/raw_gadget.c
@@ -968,6 +968,8 @@ static void gadget_ep_complete(struct usb_ep *ep, struct usb_request *req)
        spin_unlock_irqrestore(&dev->lock, flags);

        complete((struct completion *)req->context);
+
+       dev_dbg(&dev->gadget->dev, "ep = %px, completed\n", r_ep);
 }

 static int raw_process_ep_io(struct raw_dev *dev, struct usb_raw_ep_io *io,
@@ -1035,10 +1037,14 @@ static int raw_process_ep_io(struct raw_dev *dev, struct usb_raw_ep_io *io,
                goto out_done;
        }

+       dev_dbg(&dev->gadget->dev, "ep = %px, queued\n", ep);
+
        ret = wait_for_completion_interruptible(&done);
        if (ret) {
                dev_dbg(&dev->gadget->dev, "wait interrupted\n");
-               usb_ep_dequeue(ep->ep, ep->req);
+               int deq_ret = usb_ep_dequeue(ep->ep, ep->req);
+               if (deq_ret)
+                       dev_dbg(&dev->gadget->dev, "ep = %px, deq_ret = %d\n", ep, deq_ret);
                wait_for_completion(&done);
                spin_lock_irqsave(&dev->lock, flags);
                goto out_done;

and share the dmesg output?

Thanks!

patryk4815 commented 1 year ago

My diff:

diff --git a/raw_gadget/raw_gadget.c b/raw_gadget/raw_gadget.c
index 5eedee5..de9cb1c 100644
--- a/raw_gadget/raw_gadget.c
+++ b/raw_gadget/raw_gadget.c
@@ -1001,6 +1001,8 @@ static void gadget_ep_complete(struct usb_ep *ep, struct usb_request *req)
        spin_unlock_irqrestore(&dev->lock, flags);

        complete((struct completion *)req->context);
+
+       dev_dbg(&dev->gadget->dev, "ep = %px, ep_id=%d completed\n", r_ep->ep, r_ep->addr);
 }

 static int raw_process_ep_io(struct raw_dev *dev, struct usb_raw_ep_io *io,
@@ -1068,11 +1070,16 @@ static int raw_process_ep_io(struct raw_dev *dev, struct usb_raw_ep_io *io,
                goto out_done;
        }

+        dev_dbg(&dev->gadget->dev, "ep = %px, ep_id=%d, queued\n", ep->ep, ep->addr);
+
        ret = wait_for_completion_interruptible(&done);
        if (ret) {
-               dev_dbg(&dev->gadget->dev, "wait interrupted\n");
-               usb_ep_dequeue(ep->ep, ep->req);
+               dev_dbg(&dev->gadget->dev, "wait interrupted ep_id=%d \n", ep->addr);
+                int deq_ret = usb_ep_dequeue(ep->ep, ep->req);
+                if (deq_ret)
+                        dev_dbg(&dev->gadget->dev, "ep = %px, ep_id=%d, deq_ret = %d\n", ep->ep, ep->addr, deq_ret);
                wait_for_completion(&done);
+               dev_dbg(&dev->gadget->dev, "wait interrupted done ep_id=%d\n", ep->addr);
                spin_lock_irqsave(&dev->lock, flags);
                goto out_done;
        }

Maybe can we remove usb_ep_dequeue? From kernel documentation:

 * Note that some hardware can't clear out write fifos (to unlink the request
 * at the head of the queue) except as part of disconnecting from usb. Such
 * restrictions prevent drivers from supporting configuration changes,
 * even to configuration zero (a "chapter 9" requirement).
patryk4815 commented 1 year ago

Sometime getting this:

Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: wait interrupted ep_id=2
Dec 12 00:36:13 rpi.cypis.ovh kernel: dwc2 fe980000.usb: dwc2_hsotg_ep_stop_xfr: timeout DIEPINT.NAKEFF
Dec 12 00:36:13 rpi.cypis.ovh kernel: dwc2 fe980000.usb: dwc2_hsotg_ep_stop_xfr: timeout DOEPCTL.EPDisable
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: wait interrupted done ep_id=2
Dec 12 00:36:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued

^ this time next "message" blocks very long time and host stop responding.

Some other time:

Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: wait interrupted ep_id=2
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, deq_ret = -22
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: wait interrupted done ep_id=2
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2 completed
Dec 12 00:19:13 rpi.cypis.ovh kernel: raw-gadget.0 gadget: ep = ffff1849c1e76880, ep_id=2, queued

^ this time im getting deq_ret = -22 from usb_ep_dequeue

patryk4815 commented 1 year ago

Btw -ERESTARTSYS is not propagating to user-space don't know why ๐Ÿฆ„ When i change it to -EBUSY it is propagating ๐Ÿคจ

From kernel doc:

/*
 * These should never be seen by user programs.  To return one of ERESTART*
 * codes, signal_pending() MUST be set.  Note that ptrace can observe these
 * at syscall exit tracing, but they will never be left for the debugged user
 * process to see.
 */
#define [ERESTARTSYS](https://elixir.bootlin.com/linux/latest/C/ident/ERESTARTSYS)  512
#define [ERESTARTNOINTR](https://elixir.bootlin.com/linux/latest/C/ident/ERESTARTNOINTR)    513

(probably that is fine)

xairy commented 1 year ago

Hmm, this ERESTARTSYS issue could lead to some weird behavior... If you don't see it being propagated to userspace, that means the kernel handles it internally. And a brief look at the code shows that the kernel is likely trying to restart the syscall by itself.

Could you build the Raw Gadget module from the dev branch and try running your test again? There, the ioctl should return proper values.

xairy commented 8 months ago

Sent the fix [upstream}https://lore.kernel.org/linux-usb/0db45b1d7cc466e3d4d1ab353f61d63c977fbbc5.1698329862.git.andreyknvl@gmail.com/T/#t). Will update the master branch once the patch is merged.

xairy commented 8 months ago

Note that the device emulation code needs to watch out for the -EINTR return code and then resend the request.