zmkfirmware / zmk

ZMK Firmware Repository
https://zmk.dev/
MIT License
2.59k stars 2.66k forks source link

NKRO breaks layer-tap with "extended" keycodes #2253

Open khoek opened 4 months ago

khoek commented 4 months ago

The binding &lt my_layer F24 doesn't work properly with NKRO on, even with extended reporting enabled. I observe that: with CONFIG_ZMK_HID_KEYBOARD_NKRO_EXTENDED_REPORT=n I never see an F24 at all when tapping (as expected), but with CONFIG_ZMK_HID_KEYBOARD_NKRO_EXTENDED_REPORT=y I only infrequently get a F24 keycode after spamming, alternating between the layer-tap key and another ordinary e.g. &kp A key. Tapping the layer-tap key by itself, I never see any F24s at all. Replacing the original layer-tap binding with &lt my_layer A works fine in any case.

However, by disabling NKRO altogether with CONFIG_ZMK_HID_REPORT_TYPE_NKRO=n the problem completely goes away. I am connected over USB. Does anyone know what the cause of this could be?

caksoylar commented 4 months ago

The only difference with &lt .. F24 and &kp F24 is likely the tap duration. Former sends it for a very short duration (one HID report?) so maybe whatever is listening doesn't like that, combined with the NKRO protocol.

khoek commented 4 months ago

Thanks for your help! The thing is, from what I can see this behavior is happening on both Windows and Ubuntu, and doesn't happen for the < 12 function keys e.g. with F11 instead of F24. I'm 90% confident ZMK is deciding not to send any keycode at all when I am tapping the key, in the situation where it was compiled with NKRO on.

khoek commented 4 months ago

Yes, after setting up wireshark for USB capture I can confirm that a single tap of my &lt my_layer F24 key does cause 2 USB packets to be sent (press and release) but both of them are identical with the payload consisting of 010000... i.e. the body is all zeroes indicating no keys are being pressed. This doesn't happen over bluetooth (that is, bluetooth is working perfectly).

So there is definitely a bug in ZMK over USB + Extended Report NKRO. I'm not familiar with the codebase so it's going to take a while for me to set up a development environment and insert print statements everywhere for me to understand.

Complicating the issue is the fact that I can get the F24 to come through unreliably by alternating spamming another (working) key; from what I can see the only order which works is that I need to press the &lt my_layer F24 key, then press the other key, then release the other key, then release the &lt my_layer F24 key, all within the 200ms window for &lt to be cancelled. Perhaps there is a second pathway in the hold-tap behavior which is avoiding the bug in this case somehow? (I am just guessing, I don't know how any of this works yet.)

If anyone more knowledgeable can infer what is going wrong from this information please feel free to point me in the right direction!

khoek commented 4 months ago

Another thing is that, even more rarely, once I get the "right combination" of spammy presses to get the F24 to appear tapping the key will successfully emit more F24s for a little while. But it eventually stops working again---I get the sense that maybe there is an uninitialized memory-type problem potentially going on?

khoek commented 4 months ago

OK, I believe this is a bug in upstream Zephyr. By inserting print statements I see ZMK making a call to hid_int_ep_write() passing the 22-byte string 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00, but wireshark captures the payload 01000000000000000000000000000000000000000000 actually transmitted to the interrupt endpoint. (For the avoidance of doubt, I was testing with the F17 key here.)

Weird.

I've been testing this on an Adv360.

khoek commented 4 months ago

Okay, found the problem. There is a data race in usb_hid.c on many platforms; the call to hid_int_ep_write(), a Zephyr function, is not in general synchronous and doesn't always copy the passed buffer before beginning the USB transaction. I've only read a few of the HAL drivers, but e.g. nrfx and stm32 don't, while rpi_pico, native_posix, and kinetis do.

I conclude that copying the buffer is not part of the Zephyr function's contract (the docs are ambiguous, and on first glance would cause me to guess hid_int_ep_write() is synchronous, which is even less true). So the USB keycode report is changing underneath the HAL driver as &lt quickly deactivates the key. Noone ever noticed for smaller keycode bits because they were sent so quickly!

Pull request incoming.

khoek commented 4 months ago

The fix is in #2257. Tested working on a real Adv360.

jhorology commented 4 months ago

In my experience ,CONFIG_HID_INTERRUPT_EP_MPS=32 will solve your problem.

khoek commented 4 months ago

@jhorology Unfortunately the bug isn't not having enough bytes sent to the USB endpoint, but rather a race condition caused by an ambiguous API in upstream Zephyr (which they acknowledge, but can't change---the relevant conversation is https://github.com/zephyrproject-rtos/zephyr/pull/71306). So ZMK is going to have to work around this (at least until apparently a new USB subsystem for Zephyr which will be available at some point in the future).