zephyriot / zephyr-issues

0 stars 0 forks source link

nRF52 UART behaviour sensitive to timing of baud rate initialization. #1023

Open nashif opened 7 years ago

nashif commented 7 years ago

Reported by Carles Cufi:

In drivers/serial/uart_nrf5.c the baud rate setting helper function used to contain this code:

{code} set_baudrate = (uint32_t) (
(uint64_t)baudrate *
(uint64_t)UINT32_MAX /
(uint64_t)sys_clk_freq_hz
); {code}

which would get then compiled to an invocation of __aeabi_uldivmod to perform the 64-bit division. Even though the calculation yielded by the call to __aeabi_uldivmod was correct, and the linked version of this 64-bit division function was inspected and seemed to fit an armv7e-m Cortex-M4 with FPU disabled, the call has side-effects: for a few milliseconds the UART then after outputs garbage instead of the correct characters being sent, going back to normal after a while. Compiling with -O0 instead of the default -Os makes the problem disappear. Attached is an objdump of the Zephyr image running a modified hello_world build that triggers the issue.

(Imported from Jira ZEP-1126)

nashif commented 7 years ago

by Carles Cufi:

nashif commented 7 years ago

by Marcus Shawcroft:

nashif commented 7 years ago

by Marcus Shawcroft:

nashif commented 7 years ago

by Marcus Shawcroft:

nashif commented 7 years ago

by Marcus Shawcroft:

I managed to borrow an nrf52_pca10040 board and spent a couple of hours investigating this morning:

-Working against master version 4a8611e5c7d66aa650237bbe41c5153cc5103ad2 with the workaround fix reverted.

To reduce the problem, I've injected a dummy version of __aeabi_uldivmod directly into the source, this one computes no useful value, but can be fettled with quite easily, the actual baud rate is hardwired.

Attached is a tar file containing 3 patches.:

I am convinced that, the reduced test case:

The offending instructions in the reduced test cases all look legitimate along with PCS etc hence this does not look like a toolchain compiler or libgcc issue.

The transition from 100% failure rate to 30% failure by the removal of instructions suggests to me we have some form of race, either sw/sw or sw/hw.

nashif commented 7 years ago

by Marcus Shawcroft:

Added a replacement uart_nrf5x.S, this one replaces the __aeabi_uldivmod function with a straight line 50 nops and a blx lr, this is sufficient to recreate the issue on 'reset' about 9/10 resets.

nashif commented 7 years ago

by Marcus Shawcroft:

Added a replacement uart_nrf5.c, this is vanilla, but with a small nop loop injected. This fails reliably and suggests we have some kind of timing issue.

nashif commented 7 years ago

by Anas Nashif:

any idea what is going on here?

nashif commented 7 years ago

by Marcus Shawcroft:

The last example attached demonstrates that small changes in code timing in the driver provoke the issue. Carles Cufi and I spent an afternoon exchanging fragments of code and reached the point that we could both reliably reproduce the issue by introducing a short nop loop in the driver in place of the __aeaabi_uldivmod() call. We both concluded __aeabi_uldivmod was therefore not the issue, but rather the issue was related to timing changes in the driver.

Carles Cufi please correct the above if you think I've miss represented the situation.

Carles Cufi IIRC you were hoping to investigate the nature of the timing issue further, have you been able to progress further?

nashif commented 7 years ago

by Carles Cufi:

Marcus Shawcroft : no, there has been no progress so far. I've contacted the Nordic hardware team to see if they knew of any possible cause, but they were unable to provide one. As soon as I can I will try to generate a clean, bare-metal sample that reproduces the issue to send it to them, since at this point I don't think this is Zephyr-related but rather hardware-related.

nashif commented 7 years ago

by Mark Linkmeyer:

Correcting the priority field

nashif commented 7 years ago

by Carles Cufi:

This should really be downgraded to low priority. It has no real effect on the correct functioning of the OS and it is mostly cosmetic.