zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.82k stars 6.6k forks source link

STM32 I2C v2 lockup with invalid data and read/write #70077

Open attie-argentum opened 7 months ago

attie-argentum commented 7 months ago

Fairly regularly (around ~1.5 hours, on a 100Hz interrupt, with other transactions alongside), I find that the system has locked up - some interrupts are still being serviced (e.g: I2C, CAN), but the application flow has entirely halted.

I'm running f51c8ee739, with an out-of-tree ICM-20948 driver using I2C address 0x68.

On investigation, I have found the following:

"Good" Transaction

image

"Bad" Transaction

Scrolled for decode visibility on what should be the final read, of 14x bytes (only 13x present). Green is my trigger point (SCL staying low for too long). Red is set high immediately before the call to i2c_transfer_dt(), and low immediately after - execution never returns.

In this specific instance, noticed that the final i2c_msg.buf (should be a destination buffer that gets populated) was pointing at memory containing: 0x00 (possibly seen in transaction 3), 0x9f, 0x8d, 0xb8, 0x00, 0x00, 0x00, 0x00, 0xc1, 0xcc, 0x01, 0x08, 0xff, 0xff (seen in transaction 4).

image

(gdb) print/x *((I2C_TypeDef*)0x40005400)
$174 = {
  CR1 = 0xf7,
  CR2 = 0xe00d0,
  OAR1 = 0x0,
  OAR2 = 0x0,
  TIMINGR = 0x20b90d1e,
  TIMEOUTR = 0x0,
  ISR = 0x8003,
  ICR = 0x0,
  PECR = 0x0,
  RXDR = 0x0,
  TXDR = 0xff
}

Expected behavior

I2C operates correctly

Impact

Frequent lockups, and heavily reduced reliability. Near showstopper.

Additional Context

I will be continuing to work on this. Until earlier today, I was suspicious of my code, but in light of some recent discoveries I plan to take a close look at the I2C driver.

attie-argentum commented 7 months ago

Further details... I'm fairly confident in the following, but any other eyes on my stack decoding would be appreciated.

To summarize the below:


(gdb) print/x __stack_chk_guard
$375 = 0xe6658b00
(gdb) print/d k_sys_work_q.thread.stack_info.size
$376 = 2048
(gdb) print/a k_sys_work_q.thread.callee_saved.psp
$377 = 0x20014190 <sys_work_q_stack+1592>

# (2048 - 1592) / 4 = 114
# NOTE: highlight 0xe6658b00 if possible

(gdb) x/114wa k_sys_work_q.thread.callee_saved.psp
0x20014190 <sys_work_q_stack+1592>:     0x0     0xe000ed00      0x2000f308 <_kernel>    0x0
0x200141a0 <sys_work_q_stack+1608>:     0x200006b0 <barometer_timer>    0x801d399 <z_impl_k_sem_take+96>        0x800a020 <arch_swap+48>        0x61000000
0x200141b0 <sys_work_q_stack+1624>:     0x0     0xe6658b00      0x0     0x801d399 <z_impl_k_sem_take+96>
0x200141c0 <sys_work_q_stack+1640>:     0x1388  0x0     0x1388  0xe6658b00
0x200141d0 <sys_work_q_stack+1656>:     0x200106dc <i2c_stm32_dev_data_0>       0x8010d27 <stm32_i2c_transaction+346>   0x400   0x4
0x200141e0 <sys_work_q_stack+1672>:     0xa3662414      0x68    0x0     0x200142d4 <sys_work_q_stack+1916>
0x200141f0 <sys_work_q_stack+1688>:     0xe     0x7     0x10aad00       0xe6658b00
0x20014200 <sys_work_q_stack+1704>:     0x0     0x200142d0 <sys_work_q_stack+1912>      0x1     0x68
0x20014210 <sys_work_q_stack+1720>:     0x20010c92 <i2c_stm32_in_transfer>      0x200106ec <i2c_stm32_dev_data_0+16>    0x801e9e0 <__device_dts_ord_18> 0x20010c91 <i2c_stm32_trigger_mode>
0x20014220 <sys_work_q_stack+1736>:     0x20010ef5 <icm20948_state_in_isr_transfer>     0x8010f47 <i2c_stm32_transfer+166>      0x0     0x68
0x20014230 <sys_work_q_stack+1752>:     0x20014280 <sys_work_q_stack+1832>      0x801a000 <encode_header_byte>  0x2000f5b4 <power_signal_work>  0xe6658b00
0x20014240 <sys_work_q_stack+1768>:     0xbf7   0x8010ea1 <i2c_stm32_transfer>  0x20010ef4 <icm20948_state_in_transfer> 0x2001078c <icm20948_0_data>
0x20014250 <sys_work_q_stack+1784>:     0x0     0x801f748 <icm20948_0_config>   0x801eaa8 <__device_dts_ord_88> 0x0
0x20014260 <sys_work_q_stack+1800>:     0x0     0x801a7af <_icm20948_transfer+90>       0x801007f <entropy_stm32_rng_get_entropy_isr+186>       0xe6658b00
0x20014270 <sys_work_q_stack+1816>:     0x2001428c <sys_work_q_stack+1844>      0x2001079c <icm20948_0_data+16> 0x48000000      0x20010ef6 <icm20948_state_in_isr_work>
0x20014280 <sys_work_q_stack+1832>:     0x20010ef5 <icm20948_state_in_isr_transfer>     0x801a8d7 <icm20948_isr_work+126>       0x0     0x20014268 <sys_work_q_stack+1808>
0x20014290 <sys_work_q_stack+1848>:     0x2     0x4     0x200142c8 <sys_work_q_stack+1904>      0x1
0x200142a0 <sys_work_q_stack+1864>:     0x4     0x200142d0 <sys_work_q_stack+1912>      0x4     0x5
0x200142b0 <sys_work_q_stack+1880>:     0x200142cc <sys_work_q_stack+1908>      0x1     0x4     0x200142d4 <sys_work_q_stack+1916>
0x200142c0 <sys_work_q_stack+1896>:     0xe     0x7     0x800a019 <arch_swap+40>        0x6100002d
0x200142d0 <sys_work_q_stack+1912>:     0x100   0xe6658b00      0x0     0x801cdbd <z_sched_wait+28>
0x200142e0 <sys_work_q_stack+1928>:     0xffffffff      0xe6658b00      0x4     0x2001079c <icm20948_0_data+16>
0x200142f0 <sys_work_q_stack+1944>:     0x2000f360 <k_sys_work_q>       0x200107d4 <pending_cancels>    0x0     0xffffffff
0x20014300 <sys_work_q_stack+1960>:     0xffffffff      0x801d50f <work_queue_main+142> 0xffffffff      0xffffffff
0x20014310 <sys_work_q_stack+1976>:     0x0     0xaaaaaaaa      0xaaaaaaaa      0xe6658b00
0x20014320 <sys_work_q_stack+1992>:     0xaaaaaaaa      0x801d481 <work_queue_main>     0x0     0x0
0x20014330 <sys_work_q_stack+2008>:     0x0     0x0     0x0     0x0
0x20014340 <sys_work_q_stack+2024>:     0x0     0x800583f <z_thread_entry+22>   0x801d481 <work_queue_main>     0xe6658b00
0x20014350 <sys_work_q_stack+2040>:     0x0     0xaaaaaaaa
$ arm-none-eabi-nm -S ./zephyr/zephyr.elf | grep stm32_i2c_transaction
08010bcc 000001f4 T stm32_i2c_transaction
$ arm-none-eabi-addr2line -e ./zephyr/zephyr.elf 0x08010d26
/home/user/zephyrproject/zephyr/drivers/i2c/i2c_ll_stm32_v2.c:596

.../build/zephyr/zephyr.lst:

8010d04:   f7ff fda0   bl  8010848 <msg_init>
   stm32_i2c_enable_transfer_interrupts(dev);
8010d08:   6868        ldr r0, [r5, #4]
8010d0a:   f7ff fd4d   bl  80107a8 <stm32_i2c_enable_transfer_interrupts.isra.0>
 SET_BIT(I2Cx->CR1, I2C_CR1_RXIE);
8010d0e:   f8d9 3000   ldr.w   r3, [r9]
8010d12:   f043 0304   orr.w   r3, r3, #4
8010d16:   f8c9 3000   str.w   r3, [r9]
8010d1a:   f241 3288   movw    r2, #5000   @ 0x1388
8010d1e:   2300        movs    r3, #0
8010d20:   4620        mov r0, r4
8010d22:   f00c fb09   bl  801d338 <z_impl_k_sem_take>
   if (k_sem_take(&data->device_sync_sem,
8010d26:   b150        cbz r0, 8010d3e <stm32_i2c_transaction+0x172>
       stm32_i2c_master_mode_end(dev);
8010d28:   4628        mov r0, r5
8010d2a:   f7ff fd63   bl  80107f4 <stm32_i2c_master_mode_end>
8010d2e:   f04f 32ff   mov.w   r2, #4294967295 @ 0xffffffff
8010d32:   f04f 33ff   mov.w   r3, #4294967295 @ 0xffffffff
8010d36:   4620        mov r0, r4

.../zephyr/drivers/i2c/i2c_ll_stm32_v2.c: (doesn't match up with the repo, I have local changes)

static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
               uint8_t *next_msg_flags, uint16_t slave)
{
[...]
    msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_READ);

    stm32_i2c_enable_transfer_interrupts(dev);
    LL_I2C_EnableIT_RX(i2c);

    if (k_sem_take(&data->device_sync_sem,                        # <-- line 596
               K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
        stm32_i2c_master_mode_end(dev);
        k_sem_take(&data->device_sync_sem, K_FOREVER);
        is_timeout = true;
    }

... but the semaphore isn't timing out?!

In fact, the semaphore believes nobody is waiting:

(gdb) print i2c_stm32_dev_data_0.device_sync_sem
$379 = {
  wait_q = {
    waitq = {
      {
        head = 0x200106dc <i2c_stm32_dev_data_0>,
        next = 0x200106dc <i2c_stm32_dev_data_0>
      },
      {
        tail = 0x200106dc <i2c_stm32_dev_data_0>,
        prev = 0x200106dc <i2c_stm32_dev_data_0>
      }
    }
  },
  count = 0,
  limit = 4294967295
}
attie-argentum commented 7 months ago

Reviewing the code, I'm now fairly confident that CR1 & 0x6 should only ever be one of 0x0, 0x2 or 0x4, and never 0x7 as seen here.

attie-argentum commented 7 months ago

Inserting nop instructions got this running for 8h52m before the issue presented... and it still locked up in the same way - CR1=0xf7, sys_work_q is stuck on the k_sem_take(&data->device_sync_sem, ...), but device_sync_sem has no waiters listed.

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..bcc1af1bc6 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -508,6 +508,8 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_TX(i2c);

+       __asm__ volatile ("nop");
+
        if (k_sem_take(&data->device_sync_sem,
                       K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
                stm32_i2c_master_mode_end(dev);
@@ -567,6 +569,8 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_RX(i2c);

+       __asm__ volatile ("nop");
+
        if (k_sem_take(&data->device_sync_sem,
                       K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
                stm32_i2c_master_mode_end(dev);
attie-argentum commented 7 months ago

The following patch will regularly hit the nop - breaking here and then continuing will generally produce a bus fault with PC outside code.

Hitting the nop is problematic, as I believe the intent is that by this point, the read / write is complete, and RXIE and TXIE interrupts are disabled... it seems TXIE gets left enabled.

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..1c7c0a285d 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -840,6 +840,13 @@ int stm32_i2c_transaction(const struct device *dev,
                } else {
                        ret = stm32_i2c_msg_read(dev, &msg, flagsp, periph);
                }
+               {
+                       const struct i2c_stm32_config *cfg = dev->config;
+                       I2C_TypeDef *i2c = cfg->i2c;
+                       if ((i2c->CR1 & 0x6) != 0) {
+                               __asm__("nop"); /* <-- break here */
+                       }
+               }
                if (ret < 0) {
                        break;
                }
attie-argentum commented 7 months ago

I have now wrapped a GPIO hi/lo around the k_sem_take(&data->device_sync_sem, ...) in each of stm32_i2c_msg_write() and stm32_i2c_msg_read():

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..2556f28809 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -508,8 +508,11 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_TX(i2c);

-       if (k_sem_take(&data->device_sync_sem,
-                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
+GPIOA->BSRR = 1 << 8;
+       int ret = k_sem_take(&data->device_sync_sem,
+                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC));
+GPIOA->BSRR = 1 << (8 + 16);
+       if (ret != 0) {
                stm32_i2c_master_mode_end(dev);
                k_sem_take(&data->device_sync_sem, K_FOREVER);
                is_timeout = true;
@@ -567,8 +570,11 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_RX(i2c);

-       if (k_sem_take(&data->device_sync_sem,
-                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
+GPIOA->BSRR = 1 << 8;
+       int ret = k_sem_take(&data->device_sync_sem,
+                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC));
+GPIOA->BSRR = 1 << (8 + 16);
+       if (ret != 0) {
                stm32_i2c_master_mode_end(dev);
                k_sem_take(&data->device_sync_sem, K_FOREVER);
                is_timeout = true;

"Good" Transaction

scope_20240313_031030

"Bad" Transaction

scope_20240313_030515

attie-argentum commented 7 months ago

The prolonged runtime with addition of a nop, paired with the short k_sem_take() and incorrect sequence, suggested a timing issue to me. Perhaps N+1 k_sem_give() calls in the second transaction (first read).

I have now added a k_sem_reset() just before we setup to send:

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..b16046644c 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -503,6 +503,8 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
        data->current.is_err = 0U;
        data->current.msg = msg;

+       k_sem_reset(&data->device_sync_sem);
+
        msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_WRITE);

        stm32_i2c_enable_transfer_interrupts(dev);
@@ -562,6 +564,8 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
        data->current.is_nack = 0U;
        data->current.msg = msg;

+       k_sem_reset(&data->device_sync_sem);
+
        msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_READ);

        stm32_i2c_enable_transfer_interrupts(dev);

This isn't a "good" / robust fix, but the system has now been running for 10+ 13+ 17+ hours without issue... I'd like to properly understand if it is indeed an N+1, and if so where / why (it's clearly not happening on every transfer).

attie-argentum commented 7 months ago

@adamjansen - please could you clarify if you were also affected by this issue, and if this has (or at least appears to have) resolved it for you too?

attie-argentum commented 7 months ago

I've been trying to understand why this doesn't present as a problem more regularly - how come it doesn't always lockup, how come the semaphore appears to be in-sync most of the time.

It turns out, the issue is much more common than first thought, and presents differently.

We regularly find that the semaphore value is non-zero in stm32_i2c_transaction(), immediately before calling stm32_i2c_msg_write() or stm32_i2c_msg_read(). It also seems that in the majority of cases, transfer N is broken (I filled the final read buffer with 0x5a, and we see that value in the write, so the hypothesis above is correct), and N+1 is simply not performed (that buffer is still full of 0x5a after the call to i2c_transfer_dt() completes) - the lockup occurs when N is broken, and then N+1 is also performed.

In the first 5 minutes of runtime, I counted 37x instances of this (N broken, N+1 not performed, N+1 buffer untouched).

Here I am checking the semaphore with k_sem_count_get() and inserting a busy wait if it's non-zero for my trigger point (I will try to arrange for another GPIO to reduce impact on timing). I've also relocated the call to k_sem_reset().

diff --cc drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb,a685ebebeb..e764cdafb4
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@@ -827,6 -827,6 +833,17 @@@ int stm32_i2c_transaction(const struct
        int ret = 0;

        do { /* do ... while to allow zero-length transactions */
++              {
++                      struct i2c_stm32_data *data = dev->data;
++                      unsigned int c = k_sem_count_get(&data->device_sync_sem);
++
++                      if (c != 0) {
++                              k_busy_wait(275);
++                      }
++
++                      k_sem_reset(&data->device_sync_sem);
++              }
++
                if (msg.len > i2c_stm32_maxchunk) {
                        msg.len = i2c_stm32_maxchunk;
                        msg.flags &= ~I2C_MSG_STOP;

scope_20240313_213719

All changes rolled back, with only the pre-fill of 0x5a, resulted in a lockup after ~33 minutes:

scope_20240313_222830

attie-argentum commented 7 months ago

I've just confirmed that k_sem_give() is indeed called twice in the fatal situation.

Here, red / bottom is high tightly wrapped around each k_sem_give().

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..aa8f8a2393 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -121,7 +121,9 @@ static void stm32_i2c_master_mode_end(const struct device *dev)
                LL_I2C_Disable(i2c);
        }
 #endif
+GPIOA->BSRR = 1 << 8;
        k_sem_give(&data->device_sync_sem);
+GPIOA->BSRR = 1 << (8 + 16);
 }

 #if defined(CONFIG_I2C_TARGET)
@@ -411,7 +413,9 @@ static void stm32_i2c_event(const struct device *dev)
                        LL_I2C_GenerateStopCondition(i2c);
                } else {
                        stm32_i2c_disable_transfer_interrupts(dev);
+GPIOA->BSRR = 1 << 8;
                        k_sem_give(&data->device_sync_sem);
+GPIOA->BSRR = 1 << (8 + 16);
                }
        }

Normal situation - one k_sem_give() per i2c_msg transfer:

scope_20240314_021502_normal

Fatal situation - two k_sem_give() calls after the second transfer, causing the third to end early (and get corrupted), with knock-on effects. The system is now running the I2C interrupt (stm32_i2c_event()) forever.

scope_20240314_032022

adamjansen commented 7 months ago

I am also affected by this issue. I'm using an STM32G0B1RE and Zephyr v3.6.0. Rather than a write/read sequence, my use case is a single read with three parts:

I initially used a transaction comprising three (3) struct i2c_msg with one each for header, payload, and CRC. This works for a few dozen transactions and then locks up the I2C bus (SCL pulled low), and the MCU gets stuck processing interrupts.

As 3 struct i2c_msg segments (header, payload, CRC)

Successful transaction

image Notably, a total of 13 bytes are transferred instead of the expected 12 -- there's an extra 0xEE received that should not be. the time between segments is 50-60 μs.

Failed transaction

image The failed transaction has a long gap (~380 μs) after the header segment. The final byte of the segment is never acknowledged, and SCL remains low. Poking around with a debugger shows that the I2C driver is stuck waiting on a semaphore. Clearing the interrupt flags in the debugger allows execution to continue until the next occurrence. The issue does not occur when k_sem_reset() is called before each transaction segment.

As 2 struct i2c_msg segments (header, payload+CRC)

image

This is essentially identical to the 3-part transaction, including the additional byte after the CRC. The failed transaction is also similar.

As 1 struct i2c_msg segment (header+payload+CRC)

The three segments are all in the same direction (read), so they can be combined into a single struct i2c_msg containing header, payload, and CRC.

image

Note that the NAK correctly gets sent after the 12th byte, and only 12 bytes are transferred. I have not observed the issue when the read is structured as a single segment after hours of testing.

@attie-argentum's use case shows the issue can occur with transactions that involve direction change (and thus must be multiple struct i2c_msgs). Mine shows that the issue can also occur with multi-segment transactions in the same direction.

Because things seem to work when structured as a single struct i2c_msg, it seems like the issue may be related to using reload mode. The driver does not seem to use reload mode for the final transaction (withI2C_MSG_STOP flag).

stm32_i2c_event() handles the Transfer Complete and Transfer Complete Reload flags together. I'm not familiar enough with the STM32 I2C peripheral to know if there's something about how these flags are handled that could be contributing to the issue.

attie-argentum commented 7 months ago

@adamjansen - great info, thanks for sharing and confirming you're affected too.

aescolar commented 7 months ago

@attie-argentum thanks for looking into this. We are labeling it low in the Zephyr triage as it seems to affect a single driver. PRs are still welcome though.

attie-argentum commented 7 months ago

@aescolar - understood, thanks for clarifying.

I've got some ideas on how to approach this, but I'd like to discuss with others (and need to get time to spend on it)... the workaround above seems to do the trick for the moment - I've had it running for over 50 hours without issue recently.

attie-argentum commented 7 months ago

I've just started another project, and the system locks up immediately, on the same bug.

I've been through all commits that touch i2c_ll_stm32_v2.c in the last ~3 years without luck (listed below, some omitted due to merge conflicts / relevance)... unless the bug was introduced elsewhere, this is likely over 3 years old, but I'm having a hard time believing that.

List of commits, reverting these doesn't resolve the issue - 0861395713 - (4 months ago) drivers: smbus: implement SMBus driver for STM32 - Benedikt Schmidt - 65068fb4df - (4 months ago) drivers: i2c: stm32: Fix routing of secondary target address - Tim Woolliscroft - 01bdd845db - (6 months ago) drivers: i2c: stm32: disable reload mode at the end of a transfer - Mario Jaun - 3bc531e9c6 - (7 months ago) drivers: i2c: stm32: Support wakeup from STOP - Wouter Horré - 1d29a8c3c2 - (8 months ago) drivers: i2c: stm32 i2c driver supports 10-bit addressing - Francois Ramu - 1402d2d8b4 - (6 months ago) drivers: i2c_ll_stm32_v2: add warning about i2c speeds above fast - Jonas Otto - 78d3f2a6a0 - (10 months ago) stm32,i2c: Fix large I2C transactions on I2C V1 - Michael Grand - 2a97a0a95c - (10 months ago) drivers: i2c_ll_stm32_v2: Fix logic in target_unregister - Vincent Geneves - b8b1d75a54 - (10 months ago) drivers: i2c_ll_stm32_v2: Test slave_cfg pointer before using it - Vincent Geneves - 0f6c7e3d9c - (1 year, 5 months ago) drivers: i2c: Support for i2c 10-bit addressing for STM32 target - Ashwini M Shinde - d576b7f3d3 - (1 year, 10 months ago) drivers: i2c_ll_stm32_v2: add support for second target address - Aurelien Jarno - e1d22a0b61 - (3 years, 1 month ago) drivers: i2c_ll_stm32_v2: Add timeout on transfer - Julien D'Ascenzio

Here's my working fix, that has been used successfully on both projects now:

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..ed4af6b0dd 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -818,6 +818,7 @@ int stm32_i2c_transaction(const struct device *dev,
         * which will make the combination of all chunks to look like one big
         * transaction on the wire.
         */
+       struct i2c_stm32_data *data = dev->data;
        const uint32_t i2c_stm32_maxchunk = 255U;
        const uint8_t saved_flags = msg.flags;
        uint8_t combine_flags =
@@ -835,6 +836,9 @@ int stm32_i2c_transaction(const struct device *dev,
                        msg.flags = saved_flags;
                        flagsp = next_msg_flags;
                }
+
+               k_sem_reset(&data->device_sync_sem);
+
                if ((msg.flags & I2C_MSG_RW_MASK) == I2C_MSG_WRITE) {
                        ret = stm32_i2c_msg_write(dev, &msg, flagsp, periph);
                } else {

@teburd - as maintainer, please could you advise if you consider the above patch acceptable, or if deeper / proper investigation is required? Please bring others in if that's appropriate.

erian747 commented 7 months ago

@aescolar I understand that there are probably hundreds of things to be done in zephyr with high priority, but in my opinion i think this issue should be if not critical, atleast put to high prio also

Yes it is a single driver type on a limited set of SoC:s BUT it is a popular SoC family and I2C is very commonly used and required on a board

I am afraid if issues like this remains unsolved for too long, it will scare people from using Zephyr which would be sad because its a nice OS with a great community

Good investigation by the way @attie-argentum ! I might see if i can reproduce your finding and possibly find the root cause. I have worked alot with STM32 I2C in the past and know it can be a pain to get it to work without glitches

teburd commented 7 months ago

I've just started another project, and the system locks up immediately, on the same bug.

I've been through all commits that touch i2c_ll_stm32_v2.c in the last ~3 years without luck (listed below, some omitted due to merge conflicts / relevance)... unless the bug was introduced elsewhere, this is likely over 3 years old, but I'm having a hard time believing that. List of commits, reverting these doesn't resolve the issue

Here's my working fix, that has been used successfully on both projects now:

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..ed4af6b0dd 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -818,6 +818,7 @@ int stm32_i2c_transaction(const struct device *dev,
         * which will make the combination of all chunks to look like one big
         * transaction on the wire.
         */
+       struct i2c_stm32_data *data = dev->data;
        const uint32_t i2c_stm32_maxchunk = 255U;
        const uint8_t saved_flags = msg.flags;
        uint8_t combine_flags =
@@ -835,6 +836,9 @@ int stm32_i2c_transaction(const struct device *dev,
                        msg.flags = saved_flags;
                        flagsp = next_msg_flags;
                }
+
+               k_sem_reset(&data->device_sync_sem);
+
                if ((msg.flags & I2C_MSG_RW_MASK) == I2C_MSG_WRITE) {
                        ret = stm32_i2c_msg_write(dev, &msg, flagsp, periph);
                } else {

@teburd - as maintainer, please could you advise if you consider the above patch acceptable, or if deeper / proper investigation is required? Please bring others in if that's appropriate.

This is some really odd behavior, and should be fixed in a better way than resetting a semaphore before doing a transfer imo. I'm actually seeing other odd behavior with the v2 hardware block not matching the state machine diagrams in the user manual which is... confusing. I'm seeing that in #70260 in particular around the automatic reload when larger than 255 byte transfers are done. Before adding that things worked just fine and I'm tempted to ignore the hardware auto reload entirely as it seems error prone.

@erwango @FRASTM need to comment on this as this is an st specific driver, while I'm the maintainer and can accept a patch for this I'd really like to understand the st perspective on this first.

attie-argentum commented 7 months ago

should be fixed in a better way than resetting a semaphore before doing a transfer

@teburd - thanks for taking a look, I completely agree.

Sadly this patch appears to be very effective, and I'm very short on time to investigate this further right now.

I'm actually seeing other odd behavior with the v2 hardware block not matching the state machine diagrams in the user manual

Huh, frustrating... thanks for linking.

erwango commented 7 months ago

Thanks @attie-argentum for the work on this.

This V2 state machine is obviously a complex piece and we'd need some time to dig into this. However, having a quick look at the driver, I'm surprised by the way the semaphore is initialized: k_sem_init(&data->device_sync_sem, 0, K_SEM_MAX_LIMIT); With :

/**
 * @brief Maximum limit value allowed for a semaphore.
 *
 * This is intended for use when a semaphore does not have
 * an explicit maximum limit, and instead is just used for
 * counting purposes.
 *
 */
#define K_SEM_MAX_LIMIT UINT_MAX

So, IIUC, when the semaphore is taken, it never blocks but decrement the counter value. I don't think this the expected behavior. Did you have a try to change the limit value to 1 ?

Sorry if I'm completely off here, but this is something that I'd like to be sure we're doing correctly before going further.

attie-argentum commented 7 months ago

@erwango - thanks for taking a look, much appreciated.

Did you have a try to change the limit value to 1 ?

I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.

Looking at the screenshot from above (again below for clarity), would suggest that this should have helped... I think the second (much quicker) pulse on red is a k_sem_give() with an already non-zero value, but that's largely conjecture.

Scope screenshot, from above ![scope_20240314_032022](https://github.com/zephyrproject-rtos/zephyr/assets/36298367/c1666339-741e-4a0a-be9b-62fe209abe9c)
erwango commented 7 months ago

I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.

Ok, that was worth trying ... and I think this should be fixed anyway. Btw, I'm seeing a lof I2C drievrs using this same value, likely a copy/paste effect, unless I'm missing something. @teburd any opinion on that ?

erian747 commented 7 months ago

I can see a few potential problems with the i2c_stm32_ll_v2 driver

If i read the code correctly I2C_MSG_RESTART functionality is mixed up with the peripherals Reload feature. They are two different things, Restart is used for switching transfer direction, R to W or vice versa. Reload is a STM32 I2C V2 specific function used when transfering data chunks larger 255 bytes. If i am right in my assumption about them beeing mixed up, this is a critical flaw

Next problem is that at least 2 silicon issues mentioned in errata shets are ignored in code, most importantly the one about TXIS, where suggestion is to write 1 byte data before enabling irq

Another problem is that no data synchronization barriers are used (dsb instruction) in a few places, suppose this could be more of a problem on higher end MCUs with cache and more complex bus architectures (F7, H7 mcus)

Also a potential problem could be that an IRQ occurs during execution of irq handler, this will cause NVIC pending status to be set and irq called directly again after returning. One solution is to mask flags with their corresponding enable bits

lnhao commented 6 months ago

My machine (stm32wb55) also experienced similar issues after running for 40 hours. The program keeps triggering I2C interrupts, causing the application to be unable to be executed. The monitored I2C interrupt status register is: I2Cx ->ISR=0x8003.

Zephyr opens I2C interrupts by default. Currently, I disable I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n to run board for a period of time.

teburd commented 6 months ago

I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.

Ok, that was worth trying ... and I think this should be fixed anyway. Btw, I'm seeing a lof I2C drievrs using this same value, likely a copy/paste effect, unless I'm missing something. @teburd any opinion on that ?

I mean its likely to cause issues, as we discussed at EOSS if there's state being passed along between the sem give and the thread waiting on it, and another interrupt occurs that again re-enters the i2c isr and updates those flags before the thread is done looking at them, a race has occurred.

So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.

The semaphore not having a count limit of 1 to me is saying this semaphore could be signaled any number of times. That adds to the suspicion that this is the wrong pattern to be following.

lnhao commented 5 months ago

My machine (stm32wb55) also experienced similar issues after running for 40 hours. The program keeps triggering I2C interrupts, causing the application to be unable to be executed. The monitored I2C interrupt status register is: I2Cx ->ISR=0x8003.

Zephyr opens I2C interrupts by default. Currently, I disable I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n to run board for a period of time.

After I disabled the I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n , the device has been working for more than 10 days and it is still okay.

erwango commented 5 months ago

So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.

Is it ? Aren't irq masked while running the isr ?

erian747 commented 5 months ago

So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.

Is it ? Aren't irq masked while running the isr ?

Irq:s are masked in the I2C peripheral at certain points, BUT if a second irq triggers during execution of the ISR before I2C peripheral irqs are masked, the NVIC pending status is set and ISR is called directly again after returning from ISR regardless of I2C irq enable bits Options to fix this could be:

github-actions[bot] commented 3 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

daniel-james-sherwood commented 1 month ago

I believe I'm having a similar issue. I think there's a couple of potential race conditions in stm32_i2c_event().

1 - If the interrupt is delayed then its possible for RXNE to be detected on line 467 and cleared by reading the data on line 468. However, the next byte could arrive between here and detecting TC at line 495.

2 - If the interrupt is triggered with data->current.len non-zero, but the expected TXIS or RXNE not set, line 471/2 decrement data->current.len anyway.

I'm not entirely clear on which of these (or other potential races) is the cause, but the upshot is that a subsequent interrupt occurs with data->current.len zero and RXNE set. This doesn't actually get cleared and the isr spins.

My current fix is to read TC before TXIS/RXNE which avoids the first race and only decrement data->current.len if one of TXIS/RXNE is set.

Previously my setup would spin after 30s to 5mins, after the change it has run for 3 hours without failure.

I'm not able to provide patch but hopefully this will help someone else out.