zmkfirmware / zmk

ZMK Firmware Repository
https://zmk.dev/
MIT License
2.48k stars 2.56k forks source link

fix(hid): Eliminate data race in USB pathway causing dropped keys [ALTERNATIVE] #2267

Closed khoek closed 2 days ago

khoek commented 2 months ago

Alternative to https://github.com/zmkfirmware/zmk/pull/2257, solving the same problem which happens there, perhaps more elegantly. We simply move the existing mutex to protect calls to hid_int_ep_write() as well as the passed buffer, causing zmk_usb_hid_send_report() to become synchronous on first invocation. (Subsequent invocations of zmk_usb_hid_send_report() while the first USB TX was pending also blocked previously.)

The cost of the introduced synchronous delay upon first invocation can be mitigated (and latency I imagine actually slightly improved in some applications) by wrapping zmk_usb_hid_send_report() in a work queue, but I doubt this is worth the memory + task switching penalty. After all, the whole reason the data race wasn't discovered for so long was because the USB TX transactions were completing so blazingly fast in the first place.

Tested working on real hardware (Adv360 Pro).

Closes #2253, closes #2257.

petejohanson commented 1 month ago

I'm not sure I love this version as much. In particular, since we call the API here from the system work queue, we'll be completely blocking all processing while waiting for the USB TX to complete (unless I'm misreading this, but that seems to be the net impact), where as the current code (which admittedly still has the reported bug) avoids this.

I'm tempted to go with the solution in #2257, but ask for that to put the "copy into temp buffer" behavior behind an optional Kconfig flag, so we can have the fix in place only when using a platform/driver that requires it.

Thoughts?

khoek commented 1 month ago

@petejohanson I agree with you in all respects. Though, I think it's worth pointing out that (at least I believed when I last read the code), in the current implementation the main thread still gets hung up when processing various behaviours, e.g. if a "tap" from a hold-tap is emitted then the code will immediately emit a press then a release USB message, and the main thread will still temporarily hang on sending the second message until sending the first one is complete (for the same reason as your describe above).

All of this class of problems could be alleviated by moving the USB messages to a separate work queue with a simple handler just processing (and waiting on) each queued message item one by one. If something like that was implemented at some point down the track, would you potentially have an interest in merging it?

(I might add the Kconfig guard to the other pull tomorrow, if that's okay with you.)

petejohanson commented 4 days ago

All of this class of problems could be alleviated by moving the USB messages to a separate work queue with a simple handler just processing (and waiting on) each queued message item one by one. If something like that was implemented at some point down the track, would you potentially have an interest in merging it?

For sure, I'm down for that solution to this.

(I might add the Kconfig guard to the other pull tomorrow, if that's okay with you.)

Sounds good.

khoek commented 2 days ago

@petejohanson Sounds great! I have updated the original pull https://github.com/zmkfirmware/zmk/pull/2257 with the Kconfig flag and will close this one. Also, I was able to replace the rather hacky sequence of nested #defines and MAX()es used to obtain the maximum size of a report with a much more elegant solution (a union).