STM32 I2C v2 lockup with invalid data and read/write

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.89k stars 6.64k forks source link

STM32 I2C v2 lockup with invalid data and read/write #70077

Open attie-argentum opened 8 months ago

attie-argentum commented 8 months ago

Fairly regularly (around ~1.5 hours, on a 100Hz interrupt, with other transactions alongside), I find that the system has locked up - some interrupts are still being serviced (e.g: I2C, CAN), but the application flow has entirely halted.

I'm running f51c8ee739, with an out-of-tree ICM-20948 driver using I2C address 0x68.

On investigation, I have found the following:

This always appears to happen for a specific sequence and this device,
- The sequence should be Wx1, Rx4, Wx1, Rx14, all presented as one transaction via i2c_transfer_dt()
Transaction 3 / the final write has invalid data
Transaction 4 / the final read appears on the bus as a write, with only 13x bytes
The memory for the struct i2c_msg and stacks are confirmed as in-tact (not corrupt / overwritten / overflowed)
- Both STACK_SENTINEL and STACK_CANARIES are enabled
The I2C interrupt is firing constantly
- CR1=0xf7, ISR=0x8003
- entering via stm32_i2c_event_isr()
- stm32_i2c_event() does nothing
Disabling I2C interrupts briefly allows the application to continue, even if the I2C device receives invalid writes.
- set ((I2C_TypeDef*)0x40005400)->CR1 = 0
- cont
- set ((I2C_TypeDef*)0x40005400)->CR1 = 0xf7

"Good" Transaction

"Bad" Transaction

Scrolled for decode visibility on what should be the final read, of 14x bytes (only 13x present). Green is my trigger point (SCL staying low for too long). Red is set high immediately before the call to i2c_transfer_dt(), and low immediately after - execution never returns.

In this specific instance, noticed that the final i2c_msg.buf (should be a destination buffer that gets populated) was pointing at memory containing: 0x00 (possibly seen in transaction 3), 0x9f, 0x8d, 0xb8, 0x00, 0x00, 0x00, 0x00, 0xc1, 0xcc, 0x01, 0x08, 0xff, 0xff (seen in transaction 4).

(gdb) print/x *((I2C_TypeDef*)0x40005400)
$174 = {
  CR1 = 0xf7,
  CR2 = 0xe00d0,
  OAR1 = 0x0,
  OAR2 = 0x0,
  TIMINGR = 0x20b90d1e,
  TIMEOUTR = 0x0,
  ISR = 0x8003,
  ICR = 0x0,
  PECR = 0x0,
  RXDR = 0x0,
  TXDR = 0xff
}

Expected behavior

I2C operates correctly

Impact

Frequent lockups, and heavily reduced reliability. Near showstopper.

Additional Context

I will be continuing to work on this. Until earlier today, I was suspicious of my code, but in light of some recent discoveries I plan to take a close look at the I2C driver.

attie-argentum commented 8 months ago

Further details... I'm fairly confident in the following, but any other eyes on my stack decoding would be appreciated.

To summarize the below:

We see a malformed write on the bus
The program is currently inside stm32_i2c_msg_read() (doesn't match up with write on bus)
The thread is blocked on k_sem_take(&data->device_sync_sem, ...)
The semaphore has no waiters

(gdb) print/x __stack_chk_guard
$375 = 0xe6658b00
(gdb) print/d k_sys_work_q.thread.stack_info.size
$376 = 2048
(gdb) print/a k_sys_work_q.thread.callee_saved.psp
$377 = 0x20014190 <sys_work_q_stack+1592>

# (2048 - 1592) / 4 = 114
# NOTE: highlight 0xe6658b00 if possible

(gdb) x/114wa k_sys_work_q.thread.callee_saved.psp
0x20014190 <sys_work_q_stack+1592>:     0x0     0xe000ed00      0x2000f308 <_kernel>    0x0
0x200141a0 <sys_work_q_stack+1608>:     0x200006b0 <barometer_timer>    0x801d399 <z_impl_k_sem_take+96>        0x800a020 <arch_swap+48>        0x61000000
0x200141b0 <sys_work_q_stack+1624>:     0x0     0xe6658b00      0x0     0x801d399 <z_impl_k_sem_take+96>
0x200141c0 <sys_work_q_stack+1640>:     0x1388  0x0     0x1388  0xe6658b00
0x200141d0 <sys_work_q_stack+1656>:     0x200106dc <i2c_stm32_dev_data_0>       0x8010d27 <stm32_i2c_transaction+346>   0x400   0x4
0x200141e0 <sys_work_q_stack+1672>:     0xa3662414      0x68    0x0     0x200142d4 <sys_work_q_stack+1916>
0x200141f0 <sys_work_q_stack+1688>:     0xe     0x7     0x10aad00       0xe6658b00
0x20014200 <sys_work_q_stack+1704>:     0x0     0x200142d0 <sys_work_q_stack+1912>      0x1     0x68
0x20014210 <sys_work_q_stack+1720>:     0x20010c92 <i2c_stm32_in_transfer>      0x200106ec <i2c_stm32_dev_data_0+16>    0x801e9e0 <__device_dts_ord_18> 0x20010c91 <i2c_stm32_trigger_mode>
0x20014220 <sys_work_q_stack+1736>:     0x20010ef5 <icm20948_state_in_isr_transfer>     0x8010f47 <i2c_stm32_transfer+166>      0x0     0x68
0x20014230 <sys_work_q_stack+1752>:     0x20014280 <sys_work_q_stack+1832>      0x801a000 <encode_header_byte>  0x2000f5b4 <power_signal_work>  0xe6658b00
0x20014240 <sys_work_q_stack+1768>:     0xbf7   0x8010ea1 <i2c_stm32_transfer>  0x20010ef4 <icm20948_state_in_transfer> 0x2001078c <icm20948_0_data>
0x20014250 <sys_work_q_stack+1784>:     0x0     0x801f748 <icm20948_0_config>   0x801eaa8 <__device_dts_ord_88> 0x0
0x20014260 <sys_work_q_stack+1800>:     0x0     0x801a7af <_icm20948_transfer+90>       0x801007f <entropy_stm32_rng_get_entropy_isr+186>       0xe6658b00
0x20014270 <sys_work_q_stack+1816>:     0x2001428c <sys_work_q_stack+1844>      0x2001079c <icm20948_0_data+16> 0x48000000      0x20010ef6 <icm20948_state_in_isr_work>
0x20014280 <sys_work_q_stack+1832>:     0x20010ef5 <icm20948_state_in_isr_transfer>     0x801a8d7 <icm20948_isr_work+126>       0x0     0x20014268 <sys_work_q_stack+1808>
0x20014290 <sys_work_q_stack+1848>:     0x2     0x4     0x200142c8 <sys_work_q_stack+1904>      0x1
0x200142a0 <sys_work_q_stack+1864>:     0x4     0x200142d0 <sys_work_q_stack+1912>      0x4     0x5
0x200142b0 <sys_work_q_stack+1880>:     0x200142cc <sys_work_q_stack+1908>      0x1     0x4     0x200142d4 <sys_work_q_stack+1916>
0x200142c0 <sys_work_q_stack+1896>:     0xe     0x7     0x800a019 <arch_swap+40>        0x6100002d
0x200142d0 <sys_work_q_stack+1912>:     0x100   0xe6658b00      0x0     0x801cdbd <z_sched_wait+28>
0x200142e0 <sys_work_q_stack+1928>:     0xffffffff      0xe6658b00      0x4     0x2001079c <icm20948_0_data+16>
0x200142f0 <sys_work_q_stack+1944>:     0x2000f360 <k_sys_work_q>       0x200107d4 <pending_cancels>    0x0     0xffffffff
0x20014300 <sys_work_q_stack+1960>:     0xffffffff      0x801d50f <work_queue_main+142> 0xffffffff      0xffffffff
0x20014310 <sys_work_q_stack+1976>:     0x0     0xaaaaaaaa      0xaaaaaaaa      0xe6658b00
0x20014320 <sys_work_q_stack+1992>:     0xaaaaaaaa      0x801d481 <work_queue_main>     0x0     0x0
0x20014330 <sys_work_q_stack+2008>:     0x0     0x0     0x0     0x0
0x20014340 <sys_work_q_stack+2024>:     0x0     0x800583f <z_thread_entry+22>   0x801d481 <work_queue_main>     0xe6658b00
0x20014350 <sys_work_q_stack+2040>:     0x0     0xaaaaaaaa

$ arm-none-eabi-nm -S ./zephyr/zephyr.elf | grep stm32_i2c_transaction
08010bcc 000001f4 T stm32_i2c_transaction

stm32_i2c_transaction+346 --> 0x08010bcc + 346 = 0x08010d26

$ arm-none-eabi-addr2line -e ./zephyr/zephyr.elf 0x08010d26
/home/user/zephyrproject/zephyr/drivers/i2c/i2c_ll_stm32_v2.c:596

.../build/zephyr/zephyr.lst:

8010d04:   f7ff fda0   bl  8010848 <msg_init>
   stm32_i2c_enable_transfer_interrupts(dev);
8010d08:   6868        ldr r0, [r5, #4]
8010d0a:   f7ff fd4d   bl  80107a8 <stm32_i2c_enable_transfer_interrupts.isra.0>
 SET_BIT(I2Cx->CR1, I2C_CR1_RXIE);
8010d0e:   f8d9 3000   ldr.w   r3, [r9]
8010d12:   f043 0304   orr.w   r3, r3, #4
8010d16:   f8c9 3000   str.w   r3, [r9]
8010d1a:   f241 3288   movw    r2, #5000   @ 0x1388
8010d1e:   2300        movs    r3, #0
8010d20:   4620        mov r0, r4
8010d22:   f00c fb09   bl  801d338 <z_impl_k_sem_take>
   if (k_sem_take(&data->device_sync_sem,
8010d26:   b150        cbz r0, 8010d3e <stm32_i2c_transaction+0x172>
       stm32_i2c_master_mode_end(dev);
8010d28:   4628        mov r0, r5
8010d2a:   f7ff fd63   bl  80107f4 <stm32_i2c_master_mode_end>
8010d2e:   f04f 32ff   mov.w   r2, #4294967295 @ 0xffffffff
8010d32:   f04f 33ff   mov.w   r3, #4294967295 @ 0xffffffff
8010d36:   4620        mov r0, r4

.../zephyr/drivers/i2c/i2c_ll_stm32_v2.c: (doesn't match up with the repo, I have local changes)

static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
               uint8_t *next_msg_flags, uint16_t slave)
{
[...]
    msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_READ);

    stm32_i2c_enable_transfer_interrupts(dev);
    LL_I2C_EnableIT_RX(i2c);

    if (k_sem_take(&data->device_sync_sem,                        # <-- line 596
               K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
        stm32_i2c_master_mode_end(dev);
        k_sem_take(&data->device_sync_sem, K_FOREVER);
        is_timeout = true;
    }

... but the semaphore isn't timing out?!

In fact, the semaphore believes nobody is waiting:

(gdb) print i2c_stm32_dev_data_0.device_sync_sem
$379 = {
  wait_q = {
    waitq = {
      {
        head = 0x200106dc <i2c_stm32_dev_data_0>,
        next = 0x200106dc <i2c_stm32_dev_data_0>
      },
      {
        tail = 0x200106dc <i2c_stm32_dev_data_0>,
        prev = 0x200106dc <i2c_stm32_dev_data_0>
      }
    }
  },
  count = 0,
  limit = 4294967295
}

attie-argentum commented 8 months ago

Reviewing the code, I'm now fairly confident that CR1 & 0x6 should only ever be one of 0x0, 0x2 or 0x4, and never 0x7 as seen here.

attie-argentum commented 8 months ago

Inserting nop instructions got this running for 8h52m before the issue presented... and it still locked up in the same way - CR1=0xf7, sys_work_q is stuck on the k_sem_take(&data->device_sync_sem, ...), but device_sync_sem has no waiters listed.

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..bcc1af1bc6 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -508,6 +508,8 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_TX(i2c);

+       __asm__ volatile ("nop");
+
        if (k_sem_take(&data->device_sync_sem,
                       K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
                stm32_i2c_master_mode_end(dev);
@@ -567,6 +569,8 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_RX(i2c);

+       __asm__ volatile ("nop");
+
        if (k_sem_take(&data->device_sync_sem,
                       K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
                stm32_i2c_master_mode_end(dev);

attie-argentum commented 8 months ago

The following patch will regularly hit the nop - breaking here and then continuing will generally produce a bus fault with PC outside code.

Hitting the nop is problematic, as I believe the intent is that by this point, the read / write is complete, and RXIE and TXIE interrupts are disabled... it seems TXIE gets left enabled.

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..1c7c0a285d 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -840,6 +840,13 @@ int stm32_i2c_transaction(const struct device *dev,
                } else {
                        ret = stm32_i2c_msg_read(dev, &msg, flagsp, periph);
                }
+               {
+                       const struct i2c_stm32_config *cfg = dev->config;
+                       I2C_TypeDef *i2c = cfg->i2c;
+                       if ((i2c->CR1 & 0x6) != 0) {
+                               __asm__("nop"); /* <-- break here */
+                       }
+               }
                if (ret < 0) {
                        break;
                }

attie-argentum commented 8 months ago

I have now wrapped a GPIO hi/lo around the k_sem_take(&data->device_sync_sem, ...) in each of stm32_i2c_msg_write() and stm32_i2c_msg_read():

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..2556f28809 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -508,8 +508,11 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_TX(i2c);

-       if (k_sem_take(&data->device_sync_sem,
-                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
+GPIOA->BSRR = 1 << 8;
+       int ret = k_sem_take(&data->device_sync_sem,
+                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC));
+GPIOA->BSRR = 1 << (8 + 16);
+       if (ret != 0) {
                stm32_i2c_master_mode_end(dev);
                k_sem_take(&data->device_sync_sem, K_FOREVER);
                is_timeout = true;
@@ -567,8 +570,11 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
        stm32_i2c_enable_transfer_interrupts(dev);
        LL_I2C_EnableIT_RX(i2c);

-       if (k_sem_take(&data->device_sync_sem,
-                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC)) != 0) {
+GPIOA->BSRR = 1 << 8;
+       int ret = k_sem_take(&data->device_sync_sem,
+                      K_MSEC(STM32_I2C_TRANSFER_TIMEOUT_MSEC));
+GPIOA->BSRR = 1 << (8 + 16);
+       if (ret != 0) {
                stm32_i2c_master_mode_end(dev);
                k_sem_take(&data->device_sync_sem, K_FOREVER);
                is_timeout = true;

"Good" Transaction

The third operation (write 0x2d) is entirely contained by a high period on the red (bottom) trace.
The data is correct (0x2d)
The data is close to the address

scope_20240313_031030

"Bad" Transaction

The third operation is not contained by the high period on red.
The data is incorrect (0x00, should always be 0x2d)
The data is delayed to just before the following address

scope_20240313_030515

attie-argentum commented 8 months ago

The prolonged runtime with addition of a nop, paired with the short k_sem_take() and incorrect sequence, suggested a timing issue to me. Perhaps N+1 k_sem_give() calls in the second transaction (first read).

I have now added a k_sem_reset() just before we setup to send:

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..b16046644c 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -503,6 +503,8 @@ static int stm32_i2c_msg_write(const struct device *dev, struct i2c_msg *msg,
        data->current.is_err = 0U;
        data->current.msg = msg;

+       k_sem_reset(&data->device_sync_sem);
+
        msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_WRITE);

        stm32_i2c_enable_transfer_interrupts(dev);
@@ -562,6 +564,8 @@ static int stm32_i2c_msg_read(const struct device *dev, struct i2c_msg *msg,
        data->current.is_nack = 0U;
        data->current.msg = msg;

+       k_sem_reset(&data->device_sync_sem);
+
        msg_init(dev, msg, next_msg_flags, slave, LL_I2C_REQUEST_READ);

        stm32_i2c_enable_transfer_interrupts(dev);

This isn't a "good" / robust fix, but the system has now been running for ~~10+~~ ~~13+~~ 17+ hours without issue... I'd like to properly understand if it is indeed an N+1, and if so where / why (it's clearly not happening on every transfer).

attie-argentum commented 8 months ago

@adamjansen - please could you clarify if you were also affected by this issue, and if this has (or at least appears to have) resolved it for you too?

attie-argentum commented 8 months ago

I've been trying to understand why this doesn't present as a problem more regularly - how come it doesn't always lockup, how come the semaphore appears to be in-sync most of the time.

It turns out, the issue is much more common than first thought, and presents differently.

We regularly find that the semaphore value is non-zero in stm32_i2c_transaction(), immediately before calling stm32_i2c_msg_write() or stm32_i2c_msg_read(). It also seems that in the majority of cases, transfer N is broken (I filled the final read buffer with 0x5a, and we see that value in the write, so the hypothesis above is correct), and N+1 is simply not performed (that buffer is still full of 0x5a after the call to i2c_transfer_dt() completes) - the lockup occurs when N is broken, and then N+1 is also performed.

In the first 5 minutes of runtime, I counted 37x instances of this (N broken, N+1 not performed, N+1 buffer untouched).

Here I am checking the semaphore with k_sem_count_get() and inserting a busy wait if it's non-zero for my trigger point (I will try to arrange for another GPIO to reduce impact on timing). I've also relocated the call to k_sem_reset().

diff --cc drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb,a685ebebeb..e764cdafb4
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@@ -827,6 -827,6 +833,17 @@@ int stm32_i2c_transaction(const struct
        int ret = 0;

        do { /* do ... while to allow zero-length transactions */
++              {
++                      struct i2c_stm32_data *data = dev->data;
++                      unsigned int c = k_sem_count_get(&data->device_sync_sem);
++
++                      if (c != 0) {
++                              k_busy_wait(275);
++                      }
++
++                      k_sem_reset(&data->device_sync_sem);
++              }
++
                if (msg.len > i2c_stm32_maxchunk) {
                        msg.len = i2c_stm32_maxchunk;
                        msg.flags &= ~I2C_MSG_STOP;

scope_20240313_213719

All changes rolled back, with only the pre-fill of 0x5a, resulted in a lockup after ~33 minutes:

scope_20240313_222830

attie-argentum commented 8 months ago

I've just confirmed that k_sem_give() is indeed called twice in the fatal situation.

Here, red / bottom is high tightly wrapped around each k_sem_give().

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..aa8f8a2393 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -121,7 +121,9 @@ static void stm32_i2c_master_mode_end(const struct device *dev)
                LL_I2C_Disable(i2c);
        }
 #endif
+GPIOA->BSRR = 1 << 8;
        k_sem_give(&data->device_sync_sem);
+GPIOA->BSRR = 1 << (8 + 16);
 }

 #if defined(CONFIG_I2C_TARGET)
@@ -411,7 +413,9 @@ static void stm32_i2c_event(const struct device *dev)
                        LL_I2C_GenerateStopCondition(i2c);
                } else {
                        stm32_i2c_disable_transfer_interrupts(dev);
+GPIOA->BSRR = 1 << 8;
                        k_sem_give(&data->device_sync_sem);
+GPIOA->BSRR = 1 << (8 + 16);
                }
        }

Normal situation - one k_sem_give() per i2c_msg transfer:

scope_20240314_021502_normal

Fatal situation - two k_sem_give() calls after the second transfer, causing the third to end early (and get corrupted), with knock-on effects. The system is now running the I2C interrupt (stm32_i2c_event()) forever.

scope_20240314_032022

adamjansen commented 8 months ago

I am also affected by this issue. I'm using an STM32G0B1RE and Zephyr v3.6.0. Rather than a write/read sequence, my use case is a single read with three parts:

6 byte header
4 bytes of payload data
2 byte CRC

I initially used a transaction comprising three (3) struct i2c_msg with one each for header, payload, and CRC. This works for a few dozen transactions and then locks up the I2C bus (SCL pulled low), and the MCU gets stuck processing interrupts.

As 3 `struct i2c_msg` segments (header, payload, CRC)

Successful transaction

Notably, a total of 13 bytes are transferred instead of the expected 12 -- there's an extra 0xEE received that should not be. the time between segments is 50-60 μs.

Failed transaction

The failed transaction has a long gap (~380 μs) after the header segment. The final byte of the segment is never acknowledged, and SCL remains low. Poking around with a debugger shows that the I2C driver is stuck waiting on a semaphore. Clearing the interrupt flags in the debugger allows execution to continue until the next occurrence. The issue does not occur when k_sem_reset() is called before each transaction segment.

As 2 `struct i2c_msg` segments (header, payload+CRC)

This is essentially identical to the 3-part transaction, including the additional byte after the CRC. The failed transaction is also similar.

As 1 `struct i2c_msg` segment (header+payload+CRC)

The three segments are all in the same direction (read), so they can be combined into a single struct i2c_msg containing header, payload, and CRC.

Note that the NAK correctly gets sent after the 12th byte, and only 12 bytes are transferred. I have not observed the issue when the read is structured as a single segment after hours of testing.

@attie-argentum's use case shows the issue can occur with transactions that involve direction change (and thus must be multiple struct i2c_msgs). Mine shows that the issue can also occur with multi-segment transactions in the same direction.

Because things seem to work when structured as a single struct i2c_msg, it seems like the issue may be related to using reload mode. The driver does not seem to use reload mode for the final transaction (withI2C_MSG_STOP flag).

stm32_i2c_event() handles the Transfer Complete and Transfer Complete Reload flags together. I'm not familiar enough with the STM32 I2C peripheral to know if there's something about how these flags are handled that could be contributing to the issue.

attie-argentum commented 8 months ago

@adamjansen - great info, thanks for sharing and confirming you're affected too.

aescolar commented 8 months ago

@attie-argentum thanks for looking into this. We are labeling it low in the Zephyr triage as it seems to affect a single driver. PRs are still welcome though.

attie-argentum commented 8 months ago

@aescolar - understood, thanks for clarifying.

I've got some ideas on how to approach this, but I'd like to discuss with others (and need to get time to spend on it)... the workaround above seems to do the trick for the moment - I've had it running for over 50 hours without issue recently.

attie-argentum commented 7 months ago

I've just started another project, and the system locks up immediately, on the same bug.

I've been through all commits that touch i2c_ll_stm32_v2.c in the last ~3 years without luck (listed below, some omitted due to merge conflicts / relevance)... unless the bug was introduced elsewhere, this is likely over 3 years old, but I'm having a hard time believing that.

List of commits, reverting these doesn't resolve the issue

- 0861395713 - (4 months ago) drivers: smbus: implement SMBus driver for STM32 - Benedikt Schmidt - 65068fb4df - (4 months ago) drivers: i2c: stm32: Fix routing of secondary target address - Tim Woolliscroft - 01bdd845db - (6 months ago) drivers: i2c: stm32: disable reload mode at the end of a transfer - Mario Jaun - 3bc531e9c6 - (7 months ago) drivers: i2c: stm32: Support wakeup from STOP - Wouter Horré - 1d29a8c3c2 - (8 months ago) drivers: i2c: stm32 i2c driver supports 10-bit addressing - Francois Ramu - 1402d2d8b4 - (6 months ago) drivers: i2c_ll_stm32_v2: add warning about i2c speeds above fast - Jonas Otto - 78d3f2a6a0 - (10 months ago) stm32,i2c: Fix large I2C transactions on I2C V1 - Michael Grand - 2a97a0a95c - (10 months ago) drivers: i2c_ll_stm32_v2: Fix logic in target_unregister - Vincent Geneves - b8b1d75a54 - (10 months ago) drivers: i2c_ll_stm32_v2: Test slave_cfg pointer before using it - Vincent Geneves - 0f6c7e3d9c - (1 year, 5 months ago) drivers: i2c: Support for i2c 10-bit addressing for STM32 target - Ashwini M Shinde - d576b7f3d3 - (1 year, 10 months ago) drivers: i2c_ll_stm32_v2: add support for second target address - Aurelien Jarno - e1d22a0b61 - (3 years, 1 month ago) drivers: i2c_ll_stm32_v2: Add timeout on transfer - Julien D'Ascenzio

Here's my working fix, that has been used successfully on both projects now:

diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..ed4af6b0dd 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -818,6 +818,7 @@ int stm32_i2c_transaction(const struct device *dev,
         * which will make the combination of all chunks to look like one big
         * transaction on the wire.
         */
+       struct i2c_stm32_data *data = dev->data;
        const uint32_t i2c_stm32_maxchunk = 255U;
        const uint8_t saved_flags = msg.flags;
        uint8_t combine_flags =
@@ -835,6 +836,9 @@ int stm32_i2c_transaction(const struct device *dev,
                        msg.flags = saved_flags;
                        flagsp = next_msg_flags;
                }
+
+               k_sem_reset(&data->device_sync_sem);
+
                if ((msg.flags & I2C_MSG_RW_MASK) == I2C_MSG_WRITE) {
                        ret = stm32_i2c_msg_write(dev, &msg, flagsp, periph);
                } else {

@teburd - as maintainer, please could you advise if you consider the above patch acceptable, or if deeper / proper investigation is required? Please bring others in if that's appropriate.

erian747 commented 7 months ago

@aescolar I understand that there are probably hundreds of things to be done in zephyr with high priority, but in my opinion i think this issue should be if not critical, atleast put to high prio also

Yes it is a single driver type on a limited set of SoC:s BUT it is a popular SoC family and I2C is very commonly used and required on a board

I am afraid if issues like this remains unsolved for too long, it will scare people from using Zephyr which would be sad because its a nice OS with a great community

Good investigation by the way @attie-argentum ! I might see if i can reproduce your finding and possibly find the root cause. I have worked alot with STM32 I2C in the past and know it can be a pain to get it to work without glitches

teburd commented 7 months ago

I've just started another project, and the system locks up immediately, on the same bug.

I've been through all commits that touch i2c_ll_stm32_v2.c in the last ~3 years without luck (listed below, some omitted due to merge conflicts / relevance)... unless the bug was introduced elsewhere, this is likely over 3 years old, but I'm having a hard time believing that. List of commits, reverting these doesn't resolve the issue

Here's my working fix, that has been used successfully on both projects now:
diff --git drivers/i2c/i2c_ll_stm32_v2.c drivers/i2c/i2c_ll_stm32_v2.c
index a685ebebeb..ed4af6b0dd 100644
--- drivers/i2c/i2c_ll_stm32_v2.c
+++ drivers/i2c/i2c_ll_stm32_v2.c
@@ -818,6 +818,7 @@ int stm32_i2c_transaction(const struct device *dev,
         * which will make the combination of all chunks to look like one big
         * transaction on the wire.
         */
+       struct i2c_stm32_data *data = dev->data;
        const uint32_t i2c_stm32_maxchunk = 255U;
        const uint8_t saved_flags = msg.flags;
        uint8_t combine_flags =
@@ -835,6 +836,9 @@ int stm32_i2c_transaction(const struct device *dev,
                        msg.flags = saved_flags;
                        flagsp = next_msg_flags;
                }
+
+               k_sem_reset(&data->device_sync_sem);
+
                if ((msg.flags & I2C_MSG_RW_MASK) == I2C_MSG_WRITE) {
                        ret = stm32_i2c_msg_write(dev, &msg, flagsp, periph);
                } else {
@teburd - as maintainer, please could you advise if you consider the above patch acceptable, or if deeper / proper investigation is required? Please bring others in if that's appropriate.

This is some really odd behavior, and should be fixed in a better way than resetting a semaphore before doing a transfer imo. I'm actually seeing other odd behavior with the v2 hardware block not matching the state machine diagrams in the user manual which is... confusing. I'm seeing that in #70260 in particular around the automatic reload when larger than 255 byte transfers are done. Before adding that things worked just fine and I'm tempted to ignore the hardware auto reload entirely as it seems error prone.

@erwango @FRASTM need to comment on this as this is an st specific driver, while I'm the maintainer and can accept a patch for this I'd really like to understand the st perspective on this first.

attie-argentum commented 7 months ago

should be fixed in a better way than resetting a semaphore before doing a transfer

@teburd - thanks for taking a look, I completely agree.

Sadly this patch appears to be very effective, and I'm very short on time to investigate this further right now.

I'm actually seeing other odd behavior with the v2 hardware block not matching the state machine diagrams in the user manual

Huh, frustrating... thanks for linking.

erwango commented 7 months ago

Thanks @attie-argentum for the work on this.

This V2 state machine is obviously a complex piece and we'd need some time to dig into this. However, having a quick look at the driver, I'm surprised by the way the semaphore is initialized: k_sem_init(&data->device_sync_sem, 0, K_SEM_MAX_LIMIT); With :

/**
 * @brief Maximum limit value allowed for a semaphore.
 *
 * This is intended for use when a semaphore does not have
 * an explicit maximum limit, and instead is just used for
 * counting purposes.
 *
 */
#define K_SEM_MAX_LIMIT UINT_MAX

So, IIUC, when the semaphore is taken, it never blocks but decrement the counter value. I don't think this the expected behavior. Did you have a try to change the limit value to 1 ?

Sorry if I'm completely off here, but this is something that I'd like to be sure we're doing correctly before going further.

attie-argentum commented 7 months ago

@erwango - thanks for taking a look, much appreciated.

Did you have a try to change the limit value to 1 ?

I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.

Looking at the screenshot from above (again below for clarity), would suggest that this should have helped... I think the second (much quicker) pulse on red is a k_sem_give() with an already non-zero value, but that's largely conjecture.

Scope screenshot, from above

![scope_20240314_032022](https://github.com/zephyrproject-rtos/zephyr/assets/36298367/c1666339-741e-4a0a-be9b-62fe209abe9c)

erwango commented 7 months ago

I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.

Ok, that was worth trying ... and I think this should be fixed anyway. Btw, I'm seeing a lof I2C drievrs using this same value, likely a copy/paste effect, unless I'm missing something. @teburd any opinion on that ?

erian747 commented 7 months ago

I can see a few potential problems with the i2c_stm32_ll_v2 driver

If i read the code correctly I2C_MSG_RESTART functionality is mixed up with the peripherals Reload feature. They are two different things, Restart is used for switching transfer direction, R to W or vice versa. Reload is a STM32 I2C V2 specific function used when transfering data chunks larger 255 bytes. If i am right in my assumption about them beeing mixed up, this is a critical flaw

Next problem is that at least 2 silicon issues mentioned in errata shets are ignored in code, most importantly the one about TXIS, where suggestion is to write 1 byte data before enabling irq

Another problem is that no data synchronization barriers are used (dsb instruction) in a few places, suppose this could be more of a problem on higher end MCUs with cache and more complex bus architectures (F7, H7 mcus)

Also a potential problem could be that an IRQ occurs during execution of irq handler, this will cause NVIC pending status to be set and irq called directly again after returning. One solution is to mask flags with their corresponding enable bits

lnhao commented 6 months ago

My machine (stm32wb55) also experienced similar issues after running for 40 hours. The program keeps triggering I2C interrupts, causing the application to be unable to be executed. The monitored I2C interrupt status register is: I2Cx ->ISR=0x8003.

Zephyr opens I2C interrupts by default. Currently, I disable I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n to run board for a period of time.

teburd commented 6 months ago

I must admit, I missed that - good spot. Unfortuantely, even with limit=1, the same issue occurs, and my patch is required for normal operation.

Ok, that was worth trying ... and I think this should be fixed anyway. Btw, I'm seeing a lof I2C drievrs using this same value, likely a copy/paste effect, unless I'm missing something. @teburd any opinion on that ?

I mean its likely to cause issues, as we discussed at EOSS if there's state being passed along between the sem give and the thread waiting on it, and another interrupt occurs that again re-enters the i2c isr and updates those flags before the thread is done looking at them, a race has occurred.

So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.

The semaphore not having a count limit of 1 to me is saying this semaphore could be signaled any number of times. That adds to the suspicion that this is the wrong pattern to be following.

lnhao commented 6 months ago

My machine (stm32wb55) also experienced similar issues after running for 40 hours. The program keeps triggering I2C interrupts, causing the application to be unable to be executed. The monitored I2C interrupt status register is: I2Cx ->ISR=0x8003.

Zephyr opens I2C interrupts by default. Currently, I disable I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n to run board for a period of time.

After I disabled the I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n , the device has been working for more than 10 days and it is still okay.

erwango commented 6 months ago

So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.

Is it ? Aren't irq masked while running the isr ?

erian747 commented 5 months ago

So I think this general pattern of catching the ISR, stashing state, signaling via a semaphore and going on is a bit flawed in that respect if its possible another interrupt comes in and updates those same flags potentially.

Is it ? Aren't irq masked while running the isr ?

Irq:s are masked in the I2C peripheral at certain points, BUT if a second irq triggers during execution of the ISR before I2C peripheral irqs are masked, the NVIC pending status is set and ISR is called directly again after returning from ISR regardless of I2C irq enable bits Options to fix this could be:

have 100% control of sw flow matching peripheral hw state
control NVIC mask and pend irq bits
ignore irqs which are masked by I2C peripheral in ISR

github-actions[bot] commented 3 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

daniel-james-sherwood commented 1 month ago

I believe I'm having a similar issue. I think there's a couple of potential race conditions in stm32_i2c_event().

1 - If the interrupt is delayed then its possible for RXNE to be detected on line 467 and cleared by reading the data on line 468. However, the next byte could arrive between here and detecting TC at line 495.

2 - If the interrupt is triggered with data->current.len non-zero, but the expected TXIS or RXNE not set, line 471/2 decrement data->current.len anyway.

I'm not entirely clear on which of these (or other potential races) is the cause, but the upshot is that a subsequent interrupt occurs with data->current.len zero and RXNE set. This doesn't actually get cleared and the isr spins.

My current fix is to read TC before TXIS/RXNE which avoids the first race and only decrement data->current.len if one of TXIS/RXNE is set.

Previously my setup would spin after 30s to 5mins, after the change it has run for 3 hours without failure.

I'm not able to provide patch but hopefully this will help someone else out.

erwango commented 1 week ago

My current fix is to read TC before TXIS/RXNE which avoids the first race and only decrement data->current.len if one of TXIS/RXNE is set.

Thanks @daniel-james-sherwood. @attie-argentum, @erian747 did you had a chance to try this fix ?

erian747 commented 1 week ago

I rewrote the msg_init and ISR and using direct register access and some other modifications like reading I2C->ISR into a local variable before doing anything with control registers, see below (should probably used sys_read8 and sys_write8)

Works fine tested on a few I2C devices, but i guess if im going to do a PR of it in future i think the whole stm32 i2c code should use direct register access instead of HAL to make it consistent I don't know if you agree with me that tossing out HAL in some drivers is a good idea :) Another topic maybe, but would be nice to have DMA too...

static inline void msg_init(const struct device *dev, struct i2c_msg *msg,
                uint8_t *next_msg_flags, uint16_t slave,
                uint32_t transfer)
{
    const struct i2c_stm32_config *cfg = dev->config;
    struct i2c_stm32_data *data = dev->data;
    I2C_TypeDef *i2c = cfg->i2c;

    uint32_t cr2 = i2c->CR2;
    uint32_t isr = i2c->ISR;

    cr2 &= ~(I2C_CR2_RELOAD | I2C_CR2_AUTOEND | I2C_CR2_NBYTES_Msk | I2C_CR2_SADD_Msk);

    if (I2C_ADDR_10_BITS & data->dev_config) {
        cr2 |= (uint32_t)slave | I2C_CR2_ADD10;
    } else {
        cr2 |= (uint32_t)slave << 1;
    }

    i2c->CR1 |= I2C_CR1_PE;

    /*
     * If this is not a stop message or
     * if more messages follow without change of direction or
     * size is larger then 255 then reload mode must be used
     */
    if ((!(msg->flags & I2C_MSG_STOP) && next_msg_flags &&
            !(*next_msg_flags & I2C_MSG_RESTART)) || (msg->len > 255)) {
        cr2 |= I2C_CR2_RELOAD;
    }

    /* Limit transfer size to 255 bytes */
    cr2 |= MIN(255ul, msg->len) << I2C_CR2_NBYTES_Pos;
    i2c->CR2 = cr2;

    /* A start condition shall be sent if
     * - msg->flags contains I2C_MSG_RESTART or
     * - TC and TCR in ISR register is cleared
     */
    if ((!(isr & (I2C_ISR_TCR | I2C_ISR_TC)) || (msg->flags & I2C_MSG_RESTART))) {

        /* First message or restart */
        if (transfer == LL_I2C_REQUEST_WRITE) {
            cr2 &= ~I2C_CR2_RD_WRN;
            i2c->CR2 = cr2;
            /* Prepare first byte in TX buffer before transfer start */
            if (data->current.len) {
                i2c->TXDR = *data->current.buf;
                data->current.len--;
                data->current.buf++;
            }
        } else {
            cr2 |= I2C_CR2_RD_WRN;
            i2c->CR2 = cr2;
        }
        /* Issue start */
        i2c->CR2 = cr2 | I2C_CR2_START;
    }

#if defined(CONFIG_I2C_TARGET)
    data->master_active = true;
#endif
}


static void stm32_i2c_event(const struct device *dev)
{
    const struct i2c_stm32_config *cfg = dev->config;
    struct i2c_stm32_data *data = dev->data;
    I2C_TypeDef *i2c = cfg->i2c;
    uint32_t isr = i2c->ISR;
#if defined(CONFIG_I2C_TARGET)
    if (data->slave_attached && !data->master_active) {
        stm32_i2c_slave_event(dev);
        return;
    }
#endif
    /* NACK received */
    if (isr & I2C_ISR_NACKF) {
        i2c->ICR = I2C_ICR_NACKCF;
        data->current.is_nack = 1U;
        /*
         * AutoEndMode is always disabled in master mode,
         * so send a stop condition manually
         */
        i2c->CR1 &= ~(I2C_CR1_TCIE | I2C_CR1_RXIE | I2C_CR1_TXIE);
        i2c->CR2 |= I2C_CR2_STOP;
        /* Flush TX */
        i2c->ISR = I2C_ISR_TXE;
    } else if (isr & I2C_ISR_RXNE) {
        uint8_t rx = i2c->RXDR;
        if (data->current.len) {
            *data->current.buf = rx;
            data->current.len--;
            data->current.buf++;
        }
    } else if (isr & I2C_ISR_TXIS) {
        /* Send next byte */
        i2c->TXDR = *data->current.buf;
        if(data->current.len) {
            data->current.len--;
            data->current.buf++;
        }
    /* STOP received */
    } else if (isr & I2C_ISR_STOPF) {
        i2c->ICR = I2C_ICR_STOPCF;
        i2c->CR1 &= ~(I2C_CR1_ERRIE | I2C_CR1_STOPIE | I2C_CR1_NACKIE | I2C_CR1_TCIE | I2C_CR1_RXIE | I2C_CR1_TXIE);
        i2c->CR1 &= ~(I2C_CR1_PE);
        k_sem_give(&data->device_sync_sem);

    /* Transfer Complete Reload mode active, prepare size of next chunk*/
    } else if (isr & (I2C_ISR_TCR | I2C_ISR_TC)) {
        uint32_t cr2 = i2c->CR2;

        /* If more than 255 bytes remains*/
        if (data->current.len > 255) {
            /* keep RELOAD mode and set NBYTES to 255 again */
            i2c->CR2 = cr2;
        } else {
            cr2 &= ~(I2C_CR2_RELOAD | I2C_CR2_NBYTES_Msk);
            if (data->current.len && (isr & I2C_ISR_TCR)) {
                /* Set length of last data chunk */
                i2c->CR2 = cr2 | (data->current.len << I2C_CR2_NBYTES_Pos);
            } else {
                i2c->CR1 &= ~(I2C_CR1_TCIE | I2C_CR1_RXIE | I2C_CR1_TXIE);
                /* Send stop if flag set in message */
                if (data->current.msg->flags & I2C_MSG_STOP) {
                    i2c->CR2 = cr2 | I2C_CR2_STOP;
                } else {
                    /* Handover to thread */
                    k_sem_give(&data->device_sync_sem);
                }
            }
        }
    }
}

erwango commented 1 week ago

Thanks @erian747 for the update.

About

Works fine tested on a few I2C devices, but i guess if im going to do a PR of it in future i think the whole stm32 i2c code should use direct register access instead of HAL to make it consistent I don't know if you agree with me that tossing out HAL in some drivers is a good idea :)

LL API is only a basic abstraction register accesses (no variables, state machines or algorithm involved). I would be curious to understand what pushed you to get rid of it. Did you find limitations of actual bugs due to its usage ?

Anyway, focusing on your change: to ease the review (and the final merge), it would be easier to do the minimum required amount of changes in a first step, specifically by keeping LL usage when it doesn't have functional impact. This fix is for sure welcome but requires careful review and having too much unrelated changes will de-focus the review.

Regarding the fact of fully getting rid of LL API, I won't block if there is a good reason, but I'm generally not in favor. It currently allows to smoothly integrate new series thanks to its abstraction, which won't be guaranteed with direct registers access.

erian747 commented 1 week ago

IMO most importantly is to get better control of interface sw->hw For example the function below in original I2C code, here interrupts are masked one by one because it is the only way offered by LL_I2C and in worst case an IRQ is triggered in the middle of the sequence and calls the ISR which probably is not expected behavior (note, haven't check if there are any real problems caused by this, just used it as an example)

static void stm32_i2c_disable_transfer_interrupts(const struct device *dev)
{
    const struct i2c_stm32_config *cfg = dev->config;
    struct i2c_stm32_data *data = dev->data;
    I2C_TypeDef *i2c = cfg->i2c;

    LL_I2C_DisableIT_TX(i2c);
    LL_I2C_DisableIT_RX(i2c);
    LL_I2C_DisableIT_STOP(i2c);
    LL_I2C_DisableIT_NACK(i2c);
    LL_I2C_DisableIT_TC(i2c);

    if (!data->smbalert_active) {
        LL_I2C_DisableIT_ERR(i2c);
    }
}

With direct register access, all interrupts can be masked with only one read and write

i2c->CR1 &= ~(I2C_CR1_RXIE | I2C_CR1_TXIE | I2C_CR1_TCIE | I2C_CR1_STOPIE | I2C_CR1_NACKIE);

or completely atomic with a write only if we have know the exact state we which to enter, for example

/* Disable IRQ:s but keep I2C peripheral enabled */
i2c->CR1 = I2C_CR1_PE;

Correct me if am wrong but also code size and speed will benefit from this as registers are declared volatile so compiler can't optimize LL_I2C_xxx register modifications. Especially here when using interrupt driven data transfer mode we want every ISR call to be as fast as possible

I think ST:s intention by creating wrapper functions around bits in registers mostly is they thinking it would be easier for a programmer to start using their hardware. But unfortunately at this low level it is actually makes it harder

Anyway, focusing on your change: to ease the review (and the final merge), it would be easier to do the minimum required amount of changes in a first step, specifically by keeping LL usage when it doesn't have functional impact. This fix is for sure welcome but requires careful review and having too much unrelated changes will de-focus the review.

I agree with you here, better focus on getting driver stable than putting lot of effort on removing LL just for the sake of it in this stage Having a quite busy time right now but i can try to fix a PR end of next week or so if its interesting

zephyrproject-rtos / zephyr

STM32 I2C v2 lockup with invalid data and read/write #70077

"Good" Transaction

"Bad" Transaction

Expected behavior

Impact

Additional Context

"Good" Transaction

"Bad" Transaction

As 3 struct i2c_msg segments (header, payload, CRC)

Successful transaction

Failed transaction

As 2 struct i2c_msg segments (header, payload+CRC)

As 1 struct i2c_msg segment (header+payload+CRC)

As 3 `struct i2c_msg` segments (header, payload, CRC)

As 2 `struct i2c_msg` segments (header, payload+CRC)

As 1 `struct i2c_msg` segment (header+payload+CRC)