zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.71k stars 6.54k forks source link

net: l2: ieee802154: radio: inconsistent/non-standard definition of RX/TX timestamp #59245

Closed fgrandel closed 1 year ago

fgrandel commented 1 year ago

Describe the bug

The current IEEE 802.15.4 driver API and OpenThread define the RX/TX timestamps as "the first symbol of the MAC Header" (net_pkt.h, line 102).

But this point in time is not what the standard suggests as RX/TX timestamp (see IEEE 802.15.4-2020, section 8.3.4, table 8-90, Timestamp) - see details below in "Expected Behavior". Such a point in time cannot be used for synchronization purposes outside L2 without leaking implementation details of the PHY, see OpenThread workaround below as an example.

The two current drivers supporting RX timestamps implement this timestamp inconsistently:

The OpenThread platform API historically used the start of MAC as its RX timestamp (see @edmont's comment below) but has since turned to the standard definition (see include/openthread/platform/radio.h) "The value SHALL be the time when the SFD was received" (line 209f) and uses it in practice precisely as the standard would define it with a value of macSyncSymbolOffset == 0, see radio.c, line 189: "OpenThread expects the timestamp to point to the end of SFD".

The IEEE 802.15.4 driver API is based on OpenThread's previous definition, which is not used anywhere anymore. A workaround in radio.c had to be implemented that assumes a specific PHY (which is ok for Open Thread due to its limited choice of PHYs but not for other protocols built above L2 with arbitrary PHYs). The workaround in radio.c can easily be fixed (see below) to improve PHY/MAC encapsulation of the OT stack plus turn to the standard definition for all drivers and other L2 stacks.

To Reproduce

n/a (architecture issue)

Expected behavior

All drivers and call sites should use a common standard conforming concept of RX/TX timestamp.

The IEEE 802.15.4 standard (and OpenThread) define as time synchronization anchor for all PHYs (equivalently):

This is to be measured at the exact moment when this signal is present at the local antenna. Signal propagation delays between the antenna and the timestamp counter must be corrected for by the hardware and/or firmware.

This unique synchronization anchor is then referred to everywhere in the standard (e.g. synchronization with beacons, RSSI measurements, timestamping of received and transmitted frames, synchronization of CCA, ranging, etc.).

More specifically: RX and TX timestamps are defined to be measured relative to the end of the SFD, offset by a hardware specific constant parameter macSyncSymbolOffset (section 8.4.3.1, table 8-94) which L2 requests from the driver to calculate the exact synchronization point if required.

As the current radio API does not report the driver's macSyncSymbolOffset constant, the most obvious definition for any RX or TX timestamp would be the end of the SFD (i.e. drivers correct for their own offset as if their macSyncSymbolOffset was zero). This is the same definition that has been chosen by the OpenThread platform API.

The implementation of this for the RX timestamp is simple (and will be provided as PR by myself if the solution is accepted):

Alternatively macSyncSymbolOffset would have to be reported as a constant attribute by drivers implementing any of the *TIME* capabilities. As macSyncSymbolOffset does not have much use other than saving a few cycles on RX, I would go with the OpenThread solution for now to keep it simple.

TX timestamping must be analyzed separately and is not in scope for this bug report.

Impact

A common concept of timestamping on the RX and TX path is required for TSCH (and other timing sensitive features in the IEEE 802.15.4 standard, like CSL, ranging, etc.) no matter the PHY, band, channel, etc. chosen.

So fixing this is a pre-condition for TSCH implementation.

Logs and console output

n/a

Environment (please complete the following information):

n/a

Additional context

n/a

fgrandel commented 1 year ago

@rlubos to you as network maintainer @tbursztyka to you as subsystem maintainer @edmont to you as you introduced the existing definition in net_pkt plus the workaround in OpenThread

edmont commented 1 year ago

@fgrandel, thanks for the detailed description. https://github.com/zephyrproject-rtos/zephyr/pull/57013 was added more like a hotfix, since historically the MAC Header start was used as reference for CSL synchronization in Thread, but that changed later on. So I agree with your proposed changes to align to the end of SFD in every place.

CC: @jciupis

fgrandel commented 1 year ago

since historically the MAC Header start was used as reference for CSL synchronization in Thread

@edmont I didn't know that. That explains perfectly where you come from. Thank you for your fast response! I amended the bug description with your information. Will provide a PR then.

fgrandel commented 1 year ago

Just one more bit of information supporting the solution proposed here: TX timestamp also de-facto points to the first symbol of SHR.

1) TX timestamp (net_pkt.txtime) is de-facto only used at a single place (timed TX in CSL context from OT stack to Nordic's radio driver). 2) Nordic's driver uses the "start of SHR" definition.

@hubertmis Two questions to you in this context:

UPDATE: I changed the proposed method name from nrf_802154_shr_timestamp_get() to nrf_802154_phr_timestamp_get() for better consistency with nrf_802154_mhr_timestamp_get(). I also had a silly typo in my sentence above - we need the start of PHR (end of SHR) on TX for standard conformance, NOT the start of SHR, @jciupis correctly recognized this below.

fgrandel commented 1 year ago

Related discussion about timing and clocks (including but not only) in the radio protocol context: #19030

hubertmis commented 1 year ago

Yes. It's convenient to configure nRF RADIO to capture timestamps of the end of the received frame and begin transmission at the passed timestamp. To simplify the implementation the driver API exposes these HW properties.

  • When fixing this - are you interested in providing a re-usable nrf_802154_shr_timestamp_get() in the nrfxlib driver API

I think that's the correct way. The _nrf802154 module should expose functions to recalculate the timestamps it passes through API calls to the format expected by the standard (start of SHR) for both TX and RX paths. Or even considering how the standard defines the frame timestamps, maybe existing functions could be modified to recalculate to the expected values internally.

@jciupis, do you think these features could be delivered as a part of _nrf802154?

jciupis commented 1 year ago

do you think these features could be delivered as a part of _nrf802154?

Yes, that's the plan. I'd like to provide an API for the nRF driver to conveniently calculate RX timestamp aligned with @fgrandel's suggestion, pointing to the end of SHR. I'd prefer that over offset conversions in the nRF5 Zephyr driver.

Once we agree on a common definition of TX timestamp different than the current one, which in my understanding is out of scope of this discussion, I'll provide alignments for nRF5 driver as well.

fgrandel commented 1 year ago

@hubertmis @jciupis Thanks for responding so quickly. It's nice that we can fix this upstream in nrfxlib. :-)

So what approach do you prefer? I see two possible solutions:

1) You want to keep the current timing semantics for RX timestamps and tx_at() for backwards compat and provide conversion methods so that we can offset those timestamps in the Zephyr driver. In this case you could provide the required changes in nrfxlib and once we have that merged and the west SHA pointer is updated in Zephyr's mainline, I can fix the rest of it in a separate PR.

2) You want to change the timing semantics for RX timestamps and ...tx_at() to the standard definition: In this case updating the driver could be done in the background without updating the west SHA pointer. I could then provide a PR for this ticket plus updating the SHA pointer myself in the same change.

Does this work? Is there maybe another way that I'm overlooking?

UPDATE: It's true that TX was out-of-scope for this discussion initially but I think it could easily be fixed in one PR as we're talking about very few call sites that need to be changed overall. But if you think that's not the case then of course we can keep it separate. Sorry for just assuming that it can be bundled.

TX background from a standard perspective: There's no such thing as timed TX in the standard's MCPS-DATA.request (IEEE 802.15.4-2020, section 8.3.2), but that is to be expected as the standard hides all timing and low-level protocol issues (beacon sync, CSL, TSCH, etc.) below the MLME API. So we're talking a PHY service (aka PLME or "radio API") here in fact. The TX timestamp is then to be reported back in the MCPS-DATA.confirm (ibid, section 8.3.3) primitive. But then there the exact same definition as on the RX path is used ("symbol boundary", ibid. section 6.5.2) which amounts to the definition given in the initial post.

So I think that from a standard and cross-vendor support perspective "start of PHR" would be the obvious choice for the radio API's timed TX definition as well.

fgrandel commented 1 year ago

@edmont Quick question to you - we now discovered (see posts above) that while the RX side is offset internally in the OT stack to point to "start of PHR", the timestamps calculated with this as a base time for CSL timing on the TX side seem to refer to "start of SHR" if I'm not mistaken.

It seems to me then, that currently OT sends packets systematically a little bit too late (exactly the number of symbols in the SHR) as the timing in the standard refers to the symbol boundary in IEEE 802.15.4-2020, section 6.5.2, which is "start of PHR". Do you agree that this might be a timing bug in OT still (which would be automatically fixed by the changes proposed here for the TX path of course)?

UPDATE: Not true, see @edmont's response below. Just an opportunity to simplify the OT/driver interface a bit.

edmont commented 1 year ago

Hi @fgrandel, OT is sending the frames at the exact timing since the SHR symbols time is being subtracted here: https://github.com/openthread/openthread/blob/main/src/core/thread/csl_tx_scheduler.cpp#L162

However, the transmission time base time is not properly documented (https://github.com/openthread/openthread/blob/main/include/openthread/platform/radio.h#L276) with regards to the exact moment in the frame it refers to. So it's a good opportunity to do so. We would need to synchronize the changes.

fgrandel commented 1 year ago

@edmont Nice! Thanks for responding so quickly and of course you're right. So from my side I'd be available to also move that TX workaround to the nRF driver plus propose improved doc on the OT and radio API side, as @edmont proposes. I think this will not only standardize the definition to something obvious but also simplify the OT side a bit.

jciupis commented 1 year ago

@fgrandel the approach I'd like to suggest to handle the changes for nRF driver is:

  1. PR to hal_nordic with new API that converts RX/TX timestamps from the format supported by the existing nRF driver API to the format agreed on above (RMARKER). The existing APIs are left unchanged for backwards compatibility. I'll handle this PR.
  2. PR to Zephyr that integrates changes in hal_nordic through a west manifest update. I can handle this one too.

In parallel, in no particular order (although some level of synchronization is going to be necessary):

If I understood you correctly, you're available to take care of these three points. Are there any missing pieces?

fgrandel commented 1 year ago

If I understood you correctly, you're available to take care of these three points.

@jciupis Correct, I can do the PRs in this area against the OT (docs - and if required - implementation) and Zephyr repos.

Are there any missing pieces?

No, not that I know of.

carlescufi commented 1 year ago

@fgrandel assigning this to you since you proposed opening the PRs for this yourself in an earlier comment.

fgrandel commented 1 year ago

@edmont @jciupis I have tried to make the necessary changes in Zephyr and OpenThread and integrate them with the work @jciupis has done on the NRF side.

OpenThread: https://github.com/fgrandel/openthread/commits/feat/standards-based-timing Zephyr: https://github.com/fgrandel/zephyr/commits/feat/net-unified-timestamp

Unfortunately I do not currently have the hardware and precise measurement tools to measure timings nor could I compile the low-level driver changes as the timing stuff requires the proprietary version of the SL. Therefore there must be bugs in there - I probably mixed up something while shuffling timing offsets back and forth.

@edmont It would be great if you could have a close look and maybe do some measurements on your side to check what's still wrong. Is there a way we can collaborate directly over Discord/WhatsApp or similar, so that I can provide fast fixes in case you find bugs? I'd be available for pairing if that helps.

UPDATE: Got some DevKits now - so just need some help in setting these up (maybe with some kind of sniffer) to do regression tests myself.

edmont commented 1 year ago

@edmont @jciupis I have tried to make the necessary changes in Zephyr and OpenThread and integrate them with the work @jciupis has done on the NRF side.

OpenThread: https://github.com/fgrandel/openthread/commits/feat/standards-based-timing Zephyr: https://github.com/fgrandel/zephyr/commits/feat/net-unified-timestamp

~Unfortunately I do not currently have the hardware and precise measurement tools to measure timings~ nor could I compile the low-level driver changes as the timing stuff requires the proprietary version of the SL. Therefore there must be bugs in there - I probably mixed up something while shuffling timing offsets back and forth.

@edmont It would be great if you could have a close look and maybe do some measurements on your side to check what's still wrong. Is there a way we can collaborate directly over Discord/WhatsApp or similar, so that I can provide fast fixes in case you find bugs? I'd be available for pairing if that helps.

UPDATE: Got some DevKits now - so just need some help in setting these up (maybe with some kind of sniffer) to do regression tests myself.

I managed to verify the changes in the two branches linked here with nRF5 hardware and I can confirm that they are correctly setting the CSL transmission time.

fgrandel commented 1 year ago

@edmont @jciupis @hubertmis PR on OpenThread side now opened as the preparative changes in Zephyr have been merged yesterday.