Open attie-argentum opened 7 months ago
Further details... I'm fairly confident in the following, but any other eyes on my stack decoding would be appreciated.
To summarize the below:
stm32_i2c_msg_read()
(doesn't match up with write on bus)k_sem_take(&data->device_sync_sem, ...)
(gdb) print/x __stack_chk_guard
$375 = 0xe6658b00
(gdb) print/d k_sys_work_q.thread.stack_info.size
$376 = 2048
(gdb) print/a k_sys_work_q.thread.callee_saved.psp
$377 = 0x20014190 <sys_work_q_stack+1592>
# (2048 - 1592) / 4 = 114
# NOTE: highlight 0xe6658b00 if possible
(gdb) x/114wa k_sys_work_q.thread.callee_saved.psp
0x20014190 <sys_work_q_stack+1592>: 0x0 0xe000ed00 0x2000f308 <_kernel> 0x0
0x200141a0 <sys_work_q_stack+1608>: 0x200006b0 <barometer_timer> 0x801d399 <z_impl_k_sem_take+96> 0x800a020 <arch_swap+48> 0x61000000
0x200141b0 <sys_work_q_stack+1624>: 0x0 0xe6658b00 0x0 0x801d399 <z_impl_k_sem_take+96>
0x200141c0 <sys_work_q_stack+1640>: 0x1388 0x0 0x1388 0xe6658b00
0x200141d0 <sys_work_q_stack+1656>: 0x200106dc <i2c_stm32_dev_data_0> 0x8010d27 <stm32_i2c_transaction+346> 0x400 0x4
0x200141e0 <sys_work_q_stack+1672>: 0xa3662414 0x68 0x0 0x200142d4 <sys_work_q_stack+1916>
0x200141f0 <sys_work_q_stack+1688>: 0xe 0x7 0x10aad00 0xe6658b00
0x20014200 <sys_work_q_stack+1704>: 0x0 0x200142d0 <sys_work_q_stack+1912> 0x1 0x68
0x20014210 <sys_work_q_stack+1720>: 0x20010c92 <i2c_stm32_in_transfer> 0x200106ec <i2c_stm32_dev_data_0+16> 0x801e9e0 <__device_dts_ord_18> 0x20010c91 <i2c_stm32_trigger_mode>
0x20014220 <sys_work_q_stack+1736>: 0x20010ef5 <icm20948_state_in_isr_transfer> 0x8010f47 <i2c_stm32_transfer+166> 0x0 0x68
0x20014230 <sys_work_q_stack+1752>: 0x20014280 <sys_work_q_stack+1832> 0x801a000 <encode_header_byte> 0x2000f5b4 <power_signal_work> 0xe6658b00
0x20014240 <sys_work_q_stack+1768>: 0xbf7 0x8010ea1 <i2c_stm32_transfer> 0x20010ef4 <icm20948_state_in_transfer> 0x2001078c <icm20948_0_data>
0x20014250 <sys_work_q_stack+1784>: 0x0 0x801f748 <icm20948_0_config> 0x801eaa8 <__device_dts_ord_88> 0x0
0x20014260 <sys_work_q_stack+1800>: 0x0 0x801a7af <_icm20948_transfer+90> 0x801007f <entropy_stm32_rng_get_entropy_isr+186> 0xe6658b00
0x20014270 <sys_work_q_stack+1816>: 0x2001428c <sys_work_q_stack+1844> 0x2001079c <icm20948_0_data+16> 0x48000000 0x20010ef6 <icm20948_state_in_isr_work>
0x20014280 <sys_work_q_stack+1832>: 0x20010ef5 <icm20948_state_in_isr_transfer> 0x801a8d7 <icm20948_isr_work+126> 0x0 0x20014268 <sys_work_q_stack+1808>
0x20014290 <sys_work_q_stack+1848>: 0x2 0x4 0x200142c8 <sys_work_q_stack+1904> 0x1
0x200142a0 <sys_work_q_stack+1864>: 0x4 0x200142d0 <sys_work_q_stack+1912> 0x4 0x5
0x200142b0 <sys_work_q_stack+1880>: 0x200142cc <sys_work_q_stack+1908> 0x1 0x4 0x200142d4 <sys_work_q_stack+1916>
0x200142c0 <sys_work_q_stack+1896>: 0xe 0x7 0x800a019 <arch_swap+40> 0x6100002d
0x200142d0 <sys_work_q_stack+1912>: 0x100 0xe6658b00 0x0 0x801cdbd <z_sched_wait+28>
0x200142e0 <sys_work_q_stack+1928>: 0xffffffff 0xe6658b00 0x4 0x2001079c <icm20948_0_data+16>
0x200142f0 <sys_work_q_stack+1944>: 0x2000f360 <k_sys_work_q> 0x200107d4 <pending_cancels> 0x0 0xffffffff
0x20014300 <sys_work_q_stack+1960>: 0xffffffff 0x801d50f <work_queue_main+142> 0xffffffff 0xffffffff
0x20014310 <sys_work_q_stack+1976>: 0x0 0xaaaaaaaa 0xaaaaaaaa 0xe6658b00
0x20014320 <sys_work_q_stack+1992>: 0xaaaaaaaa 0x801d481 <work_queue_main> 0x0 0x0
0x20014330 <sys_work_q_stack+2008>: 0x0 0x0 0x0 0x0
0x20014340 <sys_work_q_stack+2024>: 0x0 0x800583f <z_thread_entry+22> 0x801d481 <work_queue_main> 0xe6658b00
0x20014350 <sys_work_q_stack+2040>: 0x0 0xaaaaaaaa
$ arm-none-eabi-nm -S ./zephyr/zephyr.elf | grep stm32_i2c_transaction
08010bcc 000001f4 T stm32_i2c_transaction
stm32_i2c_transaction+346
--> 0x08010bcc + 346 = 0x08010d26
$ arm-none-eabi-addr2line -e ./zephyr/zephyr.elf 0x08010d26
/home/user/zephyrproject/zephyr/drivers/i2c/i2c_ll_stm32_v2.c:596
.../build/zephyr/zephyr.lst
:
8010d04: f7ff fda0 bl 8010848 <msg_init>
stm32_i2c_enable_transfer_interrupts(dev);
8010d08: 6868 ldr r0, [r5, #4]
8010d0a: f7ff fd4d bl 80107a8 <stm32_i2c_enable_transfer_interrupts.isra.0>
SET_BIT(I2Cx->CR1, I2C_CR1_RXIE);
8010d0e: f8d9 3000 ldr.w r3, [r9]
8010d12: f043 0304 orr.w r3, r3, #4
8010d16: f8c9 3000 str.w r3, [r9]
8010d1a: f241 3288 movw r2, #5000 @ 0x1388
8010d1e: 2300 movs r3, #0
8010d20: 4620 mov r0, r4
8010d22: f00c fb09 bl 801d338 <z_impl_k_sem_take>
if (k_sem_take(&data->device_sync_sem,
8010d26: b150 cbz r0, 8010d3e <stm32_i2c_transaction+0x172>
stm32_i2c_master_mode_end(dev);
8010d28: 4628 mov r0, r5
8010d2a: f7ff fd63 bl 80107f4 <stm32_i2c_master_mode_end>
8010d2e: f04f 32ff mov.w r2, #4294967295 @ 0xffffffff
8010d32: f04f 33ff mov.w r3, #4294967295 @ 0xffffffff
8010d36: 4620 mov r0, r4
.../zephyr/drivers/i2c/i2c_ll_stm32_v2.c
: (doesn't match up with the repo, I have local changes)
static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
uint8_t *next_msg_flags, uint16_t slave)
{
[...]
msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_READ);
stm32_i2c_enable_transfer_interrupts(dev);
LL_I2C_EnableIT_RX(i2c);
if (k_sem_take(&data->device_sync_sem, # <-- line 596
K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
stm32_i2c_master_mode_end(dev);
k_sem_take(&data->device_sync_sem, K_FOREVER);
is_timeout = true;
}
... but the semaphore isn't timing out?!
In fact, the semaphore believes nobody is waiting:
(gdb) print i2c_stm32_dev_data_0.device_sync_sem
$379 = {
wait_q = {
waitq = {
{
head = 0x200106dc <i2c_stm32_dev_data_0>,
next = 0x200106dc <i2c_stm32_dev_data_0>
},
{
tail = 0x200106dc <i2c_stm32_dev_data_0>,
prev = 0x200106dc <i2c_stm32_dev_data_0>
}
}
},
count = 0,
limit = 4294967295
}
Reviewing the code, I'm now fairly confident that CR1 & 0x6
should only ever be one of 0x0
, 0x2
or 0x4
, and never 0x7
as seen here.
Inserting nop
instructions got this running for 8h52m before the issue presented... and it still locked up in the same way - CR1=0xf7
, sys_work_q
is stuck on the k_sem_take(&data->device_sync_sem, ...)
, but device_sync_sem
has no waiters listed.
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..bcc1af1bc6 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -508,6 +508,8 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
stm32_i2c_enable_transfer_interrupts(dev);
LL_I2C_EnableIT_TX(i2c);
+ __asm__ volatile ("nop");
+
if (k_sem_take(&data->device_sync_sem,
K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
stm32_i2c_master_mode_end(dev);
@@ -567,6 +569,8 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
stm32_i2c_enable_transfer_interrupts(dev);
LL_I2C_EnableIT_RX(i2c);
+ __asm__ volatile ("nop");
+
if (k_sem_take(&data->device_sync_sem,
K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
stm32_i2c_master_mode_end(dev);
The following patch will regularly hit the nop
- breaking here and then continuing will generally produce a bus fault with PC outside code.
Hitting the nop
is problematic, as I believe the intent is that by this point, the read / write is complete, and RXIE and TXIE interrupts are disabled... it seems TXIE gets left enabled.
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..1c7c0a285d 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -840,6 +840,13 @@ int stm32_i2c_transaction(const struct device *dev,
} else {
ret = stm32_i2c_msg_read(dev, &msg, flagsp, periph);
}
+ {
+ const struct i2c_stm32_config *cfg = dev->config;
+ I2C_TypeDef *i2c = cfg->i2c;
+ if ((i2c->CR1 & 0x6) != 0) {
+ __asm__("nop"); /* <-- break here */
+ }
+ }
if (ret < 0) {
break;
}
I have now wrapped a GPIO hi/lo around the k_sem_take(&data->device_sync_sem, ...)
in each of stm32_i2c_msg_write()
and stm32_i2c_msg_read()
:
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..2556f28809 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -508,8 +508,11 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
stm32_i2c_enable_transfer_interrupts(dev);
LL_I2C_EnableIT_TX(i2c);
- if (k_sem_take(&data->device_sync_sem,
- K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
+GPIOA->BSRR = 1 << 8;
+ int ret = k_sem_take(&data->device_sync_sem,
+ K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC));
+GPIOA->BSRR = 1 << (8 + 16);
+ if (ret != 0) {
stm32_i2c_master_mode_end(dev);
k_sem_take(&data->device_sync_sem, K_FOREVER);
is_timeout = true;
@@ -567,8 +570,11 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
stm32_i2c_enable_transfer_interrupts(dev);
LL_I2C_EnableIT_RX(i2c);
- if (k_sem_take(&data->device_sync_sem,
- K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
+GPIOA->BSRR = 1 << 8;
+ int ret = k_sem_take(&data->device_sync_sem,
+ K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC));
+GPIOA->BSRR = 1 << (8 + 16);
+ if (ret != 0) {
stm32_i2c_master_mode_end(dev);
k_sem_take(&data->device_sync_sem, K_FOREVER);
is_timeout = true;
0x2d
) is entirely contained by a high period on the red (bottom) trace.0x2d
)0x00
, should always be 0x2d
)The prolonged runtime with addition of a nop
, paired with the short k_sem_take()
and incorrect sequence, suggested a timing issue to me. Perhaps N+1 k_sem_give()
calls in the second transaction (first read).
I have now added a k_sem_reset()
just before we setup to send:
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..b16046644c 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -503,6 +503,8 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
data->current.is_err = 0U;
data->current.msg = msg;
+ k_sem_reset(&data->device_sync_sem);
+
msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_WRITE);
stm32_i2c_enable_transfer_interrupts(dev);
@@ -562,6 +564,8 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
data->current.is_nack = 0U;
data->current.msg = msg;
+ k_sem_reset(&data->device_sync_sem);
+
msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_READ);
stm32_i2c_enable_transfer_interrupts(dev);
This isn't a "good" / robust fix, but the system has now been running for 10+ 13+ 17+ hours without issue... I'd like to properly understand if it is indeed an N+1, and if so where / why (it's clearly not happening on every transfer).
@adamjansen - please could you clarify if you were also affected by this issue, and if this has (or at least appears to have) resolved it for you too?
I've been trying to understand why this doesn't present as a problem more regularly - how come it doesn't always lockup, how come the semaphore appears to be in-sync most of the time.
It turns out, the issue is much more common than first thought, and presents differently.
We regularly find that the semaphore value is non-zero in stm32_i2c_transaction()
, immediately before calling stm32_i2c_msg_write()
or stm32_i2c_msg_read()
. It also seems that in the majority of cases, transfer N is broken (I filled the final read buffer with 0x5a
, and we see that value in the write, so the hypothesis above is correct), and N+1 is simply not performed (that buffer is still full of 0x5a
after the call to i2c_transfer_dt()
completes) - the lockup occurs when N is broken, and then N+1 is also performed.
In the first 5 minutes of runtime, I counted 37x instances of this (N broken, N+1 not performed, N+1 buffer untouched).
Here I am checking the semaphore with k_sem_count_get()
and inserting a busy wait if it's non-zero for my trigger point (I will try to arrange for another GPIO to reduce impact on timing). I've also relocated the call to k_sem_reset()
.
diff --cc drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb,a685ebebeb..e764cdafb4
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@@ -827,6 -827,6 +833,17 @@@ int stm32_i2c_transaction(const struct
int ret = 0;
do { /* do ... while to allow zero-length transactions */
++ {
++ struct i2c_stm32_data *data = dev->data;
++ unsigned int c = k_sem_count_get(&data->device_sync_sem);
++
++ if (c != 0) {
++ k_busy_wait(275);
++ }
++
++ k_sem_reset(&data->device_sync_sem);
++ }
++
if (msg.len > i2c_stm32_maxchunk) {
msg.len = i2c_stm32_maxchunk;
msg.flags &= ~I2C_MSG_STOP;
All changes rolled back, with only the pre-fill of 0x5a
, resulted in a lockup after ~33 minutes:
I've just confirmed that k_sem_give()
is indeed called twice in the fatal situation.
Here, red / bottom is high tightly wrapped around each k_sem_give()
.
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..aa8f8a2393 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -121,7 +121,9 @@ static void stm32_i2c_master_mode_end(const struct device *dev)
LL_I2C_Disable(i2c);
}
#endif
+GPIOA->BSRR = 1 << 8;
k_sem_give(&data->device_sync_sem);
+GPIOA->BSRR = 1 << (8 + 16);
}
#if defined(CONFIG_I2C_TARGET)
@@ -411,7 +413,9 @@ static void stm32_i2c_event(const struct device *dev)
LL_I2C_GenerateStopCondition(i2c);
} else {
stm32_i2c_disable_transfer_interrupts(dev);
+GPIOA->BSRR = 1 << 8;
k_sem_give(&data->device_sync_sem);
+GPIOA->BSRR = 1 << (8 + 16);
}
}
Normal situation - one k_sem_give()
per i2c_msg
transfer:
Fatal situation - two k_sem_give()
calls after the second transfer, causing the third to end early (and get corrupted), with knock-on effects. The system is now running the I2C interrupt (stm32_i2c_event()
) forever.
I am also affected by this issue. I'm using an STM32G0B1RE and Zephyr v3.6.0. Rather than a write/read sequence, my use case is a single read with three parts:
I initially used a transaction comprising three (3) struct i2c_msg
with one each for header, payload, and CRC. This works for a few dozen transactions and then locks up the I2C bus (SCL pulled low), and the MCU gets stuck processing interrupts.
struct i2c_msg
segments (header, payload, CRC)Notably, a total of 13 bytes are transferred instead of the expected 12 -- there's an extra 0xEE received that should not be. the time between segments is 50-60 μs.
The failed transaction has a long gap (~380 μs) after the header segment. The final byte of the segment is never acknowledged, and SCL remains low. Poking around with a debugger shows that the I2C driver is stuck waiting on a semaphore. Clearing the interrupt flags in the debugger allows execution to continue until the next occurrence. The issue does not occur when k_sem_reset()
is called before each transaction segment.
struct i2c_msg
segments (header, payload+CRC)This is essentially identical to the 3-part transaction, including the additional byte after the CRC. The failed transaction is also similar.
struct i2c_msg
segment (header+payload+CRC)The three segments are all in the same direction (read), so they can be combined into a single struct i2c_msg
containing header, payload, and CRC.
Note that the NAK correctly gets sent after the 12th byte, and only 12 bytes are transferred. I have not observed the issue when the read is structured as a single segment after hours of testing.
@attie-argentum's use case shows the issue can occur with transactions that involve direction change (and thus must be multiple struct i2c_msg
s). Mine shows that the issue can also occur with multi-segment transactions in the same direction.
Because things seem to work when structured as a single struct i2c_msg
, it seems like the issue may be related to using reload mode. The driver does not seem to use reload mode for the final transaction (withI2C_MSG_STOP
flag).
stm32_i2c_event()
handles the Transfer Complete and Transfer Complete Reload flags together. I'm not familiar enough with the STM32 I2C peripheral to know if there's something about how these flags are handled that could be contributing to the issue.
@adamjansen - great info, thanks for sharing and confirming you're affected too.
@attie-argentum thanks for looking into this. We are labeling it low in the Zephyr triage as it seems to affect a single driver. PRs are still welcome though.
@aescolar - understood, thanks for clarifying.
I've got some ideas on how to approach this, but I'd like to discuss with others (and need to get time to spend on it)... the workaround above seems to do the trick for the moment - I've had it running for over 50 hours without issue recently.
I've just started another project, and the system locks up immediately, on the same bug.
I've been through all commits that touch i2c_ll_stm32_v2.c
in the last ~3 years without luck (listed below, some omitted due to merge conflicts / relevance)... unless the bug was introduced elsewhere, this is likely over 3 years old, but I'm having a hard time believing that.
Here's my working fix, that has been used successfully on both projects now:
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..ed4af6b0dd 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -818,6 +818,7 @@ int stm32_i2c_transaction(const struct device *dev,
* which will make the combination of all chunks to look like one big
* transaction on the wire.
*/
+ struct i2c_stm32_data *data = dev->data;
const uint32_t i2c_stm32_maxchunk = 255U;
const uint8_t saved_flags = msg.flags;
uint8_t combine_flags =
@@ -835,6 +836,9 @@ int stm32_i2c_transaction(const struct device *dev,
msg.flags = saved_flags;
flagsp = next_msg_flags;
}
+
+ k_sem_reset(&data->device_sync_sem);
+
if ((msg.flags & I2C_MSG_RW_MASK) == I2C_MSG_WRITE) {
ret = stm32_i2c_msg_write(dev, &msg, flagsp, periph);
} else {
@teburd - as maintainer, please could you advise if you consider the above patch acceptable, or if deeper / proper investigation is required? Please bring others in if that's appropriate.
@aescolar I understand that there are probably hundreds of things to be done in zephyr with high priority, but in my opinion i think this issue should be if not critical, atleast put to high prio also
Yes it is a single driver type on a limited set of SoC:s BUT it is a popular SoC family and I2C is very commonly used and required on a board
I am afraid if issues like this remains unsolved for too long, it will scare people from using Zephyr which would be sad because its a nice OS with a great community
Good investigation by the way @attie-argentum ! I might see if i can reproduce your finding and possibly find the root cause. I have worked alot with STM32 I2C in the past and know it can be a pain to get it to work without glitches
I've just started another project, and the system locks up immediately, on the same bug.
I've been through all commits that touch
i2c_ll_stm32_v2.c
in the last ~3 years without luck (listed below, some omitted due to merge conflicts / relevance)... unless the bug was introduced elsewhere, this is likely over 3 years old, but I'm having a hard time believing that. List of commits, reverting these doesn't resolve the issueHere's my working fix, that has been used successfully on both projects now:
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c index a685ebebeb..ed4af6b0dd 100644 --- drivers/i2c/i2c_ll_stm32_v2.c +++ drivers/i2c/i2c_ll_stm32_v2.c @@ -818,6 +818,7 @@ int stm32_i2c_transaction(const struct device *dev, * which will make the combination of all chunks to look like one big * transaction on the wire. */ + struct i2c_stm32_data *data = dev->data; const uint32_t i2c_stm32_maxchunk = 255U; const uint8_t saved_flags = msg.flags; uint8_t combine_flags = @@ -835,6 +836,9 @@ int stm32_i2c_transaction(const struct device *dev, msg.flags = saved_flags; flagsp = next_msg_flags; } + + k_sem_reset(&data->device_sync_sem); + if ((msg.flags & I2C_MSG_RW_MASK) == I2C_MSG_WRITE) { ret = stm32_i2c_msg_write(dev, &msg, flagsp, periph); } else {
@teburd - as maintainer, please could you advise if you consider the above patch acceptable, or if deeper / proper investigation is required? Please bring others in if that's appropriate.
This is some really odd behavior, and should be fixed in a better way than resetting a semaphore before doing a transfer imo. I'm actually seeing other odd behavior with the v2 hardware block not matching the state machine diagrams in the user manual which is... confusing. I'm seeing that in #70260 in particular around the automatic reload when larger than 255 byte transfers are done. Before adding that things worked just fine and I'm tempted to ignore the hardware auto reload entirely as it seems error prone.
@erwango @FRASTM need to comment on this as this is an st specific driver, while I'm the maintainer and can accept a patch for this I'd really like to understand the st perspective on this first.
should be fixed in a better way than resetting a semaphore before doing a transfer
@teburd - thanks for taking a look, I completely agree.
Sadly this patch appears to be very effective, and I'm very short on time to investigate this further right now.
I'm actually seeing other odd behavior with the v2 hardware block not matching the state machine diagrams in the user manual
Huh, frustrating... thanks for linking.
Thanks @attie-argentum for the work on this.
This V2 state machine is obviously a complex piece and we'd need some time to dig into this. However, having a quick look at the driver, I'm surprised by the way the semaphore is initialized:
k_sem_init(&data->device_sync_sem, 0, K_SEM_MAX_LIMIT);
With :
/**
* @brief Maximum limit value allowed for a semaphore.
*
* This is intended for use when a semaphore does not have
* an explicit maximum limit, and instead is just used for
* counting purposes.
*
*/
#define K_SEM_MAX_LIMIT UINT_MAX
So, IIUC, when the semaphore is taken, it never blocks but decrement the counter value. I don't think this the expected behavior. Did you have a try to change the limit value to 1 ?
Sorry if I'm completely off here, but this is something that I'd like to be sure we're doing correctly before going further.
@erwango - thanks for taking a look, much appreciated.
Did you have a try to change the limit value to 1 ?
I must admit, I missed that - good spot. Unfortuantely, even with limit=1
, the same issue occurs, and my patch is required for normal operation.
Looking at the screenshot from above (again below for clarity), would suggest that this should have helped... I think the second (much quicker) pulse on red is a k_sem_give()
with an already non-zero value, but that's largely conjecture.
I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.
Ok, that was worth trying ... and I think this should be fixed anyway. Btw, I'm seeing a lof I2C drievrs using this same value, likely a copy/paste effect, unless I'm missing something. @teburd any opinion on that ?
I can see a few potential problems with the i2c_stm32_ll_v2 driver
If i read the code correctly I2C_MSG_RESTART functionality is mixed up with the peripherals Reload feature. They are two different things, Restart is used for switching transfer direction, R to W or vice versa. Reload is a STM32 I2C V2 specific function used when transfering data chunks larger 255 bytes. If i am right in my assumption about them beeing mixed up, this is a critical flaw
Next problem is that at least 2 silicon issues mentioned in errata shets are ignored in code, most importantly the one about TXIS, where suggestion is to write 1 byte data before enabling irq
Another problem is that no data synchronization barriers are used (dsb instruction) in a few places, suppose this could be more of a problem on higher end MCUs with cache and more complex bus architectures (F7, H7 mcus)
Also a potential problem could be that an IRQ occurs during execution of irq handler, this will cause NVIC pending status to be set and irq called directly again after returning. One solution is to mask flags with their corresponding enable bits
My machine (stm32wb55) also experienced similar issues after running for 40 hours. The program keeps triggering I2C interrupts, causing the application to be unable to be executed. The monitored I2C interrupt status register is: I2Cx ->ISR=0x8003.
Zephyr opens I2C interrupts by default. Currently, I disable I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n
to run board for a period of time.
I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.
Ok, that was worth trying ... and I think this should be fixed anyway. Btw, I'm seeing a lof I2C drievrs using this same value, likely a copy/paste effect, unless I'm missing something. @teburd any opinion on that ?
I mean its likely to cause issues, as we discussed at EOSS if there's state being passed along between the sem give and the thread waiting on it, and another interrupt occurs that again re-enters the i2c isr and updates those flags before the thread is done looking at them, a race has occurred.
So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.
The semaphore not having a count limit of 1 to me is saying this semaphore could be signaled any number of times. That adds to the suspicion that this is the wrong pattern to be following.
My machine (stm32wb55) also experienced similar issues after running for 40 hours. The program keeps triggering I2C interrupts, causing the application to be unable to be executed. The monitored I2C interrupt status register is: I2Cx ->ISR=0x8003.
Zephyr opens I2C interrupts by default. Currently, I disable I2C interrupt
CONFIG_I2C_STM32_INTERRUPT=n
to run board for a period of time.
After I disabled the I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n
, the device has been working for more than 10 days and it is still okay.
So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.
Is it ? Aren't irq masked while running the isr ?
So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.
Is it ? Aren't irq masked while running the isr ?
Irq:s are masked in the I2C peripheral at certain points, BUT if a second irq triggers during execution of the ISR before I2C peripheral irqs are masked, the NVIC pending status is set and ISR is called directly again after returning from ISR regardless of I2C irq enable bits Options to fix this could be:
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I believe I'm having a similar issue. I think there's a couple of potential race conditions in stm32_i2c_event().
1 - If the interrupt is delayed then its possible for RXNE to be detected on line 467 and cleared by reading the data on line 468. However, the next byte could arrive between here and detecting TC at line 495.
2 - If the interrupt is triggered with data->current.len non-zero, but the expected TXIS or RXNE not set, line 471/2 decrement data->current.len anyway.
I'm not entirely clear on which of these (or other potential races) is the cause, but the upshot is that a subsequent interrupt occurs with data->current.len zero and RXNE set. This doesn't actually get cleared and the isr spins.
My current fix is to read TC before TXIS/RXNE which avoids the first race and only decrement data->current.len if one of TXIS/RXNE is set.
Previously my setup would spin after 30s to 5mins, after the change it has run for 3 hours without failure.
I'm not able to provide patch but hopefully this will help someone else out.
Fairly regularly (around ~1.5 hours, on a 100Hz interrupt, with other transactions alongside), I find that the system has locked up - some interrupts are still being serviced (e.g: I2C, CAN), but the application flow has entirely halted.
I'm running f51c8ee739, with an out-of-tree ICM-20948 driver using I2C address 0x68.
On investigation, I have found the following:
i2c_transfer_dt()
struct i2c_msg
and stacks are confirmed as in-tact (not corrupt / overwritten / overflowed)STACK_SENTINEL
andSTACK_CANARIES
are enabledCR1=0xf7
,ISR=0x8003
stm32_i2c_event_isr()
stm32_i2c_event()
does nothingset ((I2C_TypeDef*)0x40005400)->CR1 = 0
cont
set ((I2C_TypeDef*)0x40005400)->CR1 = 0xf7
"Good" Transaction
"Bad" Transaction
Scrolled for decode visibility on what should be the final read, of 14x bytes (only 13x present). Green is my trigger point (SCL staying low for too long). Red is set high immediately before the call to
i2c_transfer_dt()
, and low immediately after - execution never returns.In this specific instance, noticed that the final
i2c_msg.buf
(should be a destination buffer that gets populated) was pointing at memory containing:0x00
(possibly seen in transaction 3),0x9f, 0x8d, 0xb8, 0x00, 0x00, 0x00, 0x00, 0xc1, 0xcc, 0x01, 0x08, 0xff, 0xff
(seen in transaction 4).Expected behavior
I2C operates correctly
Impact
Frequent lockups, and heavily reduced reliability. Near showstopper.
Additional Context
I will be continuing to work on this. Until earlier today, I was suspicious of my code, but in light of some recent discoveries I plan to take a close look at the I2C driver.