zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.2k stars 6.25k forks source link

can_mcp251xfd sometimes fails to initialize + silent errors on initialization sequence #73586

Open melvinvdb opened 1 month ago

melvinvdb commented 1 month ago

1st issue:

We're using the NRF53 with MCP25186. After mcu reboot the can_mcp251xfd sometimes fails to initialize. This will not show up in the uart logs (2nd issue), instead you will notice that no interrupts are generated when CAN messages are received. The interrupt pin will never go low. The issue lies in the reset sequence: https://github.com/zephyrproject-rtos/zephyr/blob/4b6d3f1e16ea8ccccbf85f1eaf7efd0caea11766/drivers/can/can_mcp251xfd.c#L1573 This reset method has no delay and will immediately continue with reading the MCP251XFD_REG_CON register: https://github.com/zephyrproject-rtos/zephyr/blob/4b6d3f1e16ea8ccccbf85f1eaf7efd0caea11766/drivers/can/can_mcp251xfd.c#L1603-L1606 This will set retto -EINVAL but in the goto done this will be overwritten (part of 2nd issue): https://github.com/zephyrproject-rtos/zephyr/blob/4b6d3f1e16ea8ccccbf85f1eaf7efd0caea11766/drivers/can/can_mcp251xfd.c#L1674-L1678

The solution is to add a k_msleep(5); to the bottom of the mcp251xfd_reset function. This makes the MCP2518 to always successfully initialize on MCU reboot.

2nd issue:

In mcp251xfd_init, if any of the following register configuration fails this will not be noticed by the system/uart log and will silently continue:

The whole function returns success as long as can_set_timing succeeds! To me this initialization function needs a redesign. This is not up to standard.

To Reproduce Logs and console output Reproduction might be difficult. We seem to have this issue on our custom board with an nrf5340 and mcp251863. It seems to always happen after doing:

  1. power up
  2. receive a few can message
  3. reboot from inside mcu sys_reboot(SYS_REBOOT_COLD)
  4. initialization might fail

I've added the following code to can_mcp251xfd.c to debug the MCP registers (place at the bottom):

#include <zephyr/shell/shell.h>
static void mcp_read_shell(const struct shell *sh, size_t argc, char **argv)
{

    const struct device *can_dev = DEVICE_DT_GET_ANY(microchip_mcp251xfd);
    uint32_t* reg;

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_CON, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_CON 0x%08x", *reg);
    uint8_t opmod = FIELD_GET(MCP251XFD_REG_CON_OPMOD_MASK, *reg);
    LOG_INF("MCP251XFD_REG_CON_OPMOD: %u", opmod);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_OSC, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_OSC 0x%08x", *reg);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_IOCON, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_IOCON 0x%08x", *reg);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_INT, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_INT 0x%08x", *reg);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_TDC, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_TDC 0x%08x", *reg);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_TEFCON, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_TEFCON 0x%08x", *reg);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_TXQCON, MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_TXQCON 0x%08x", *reg);

    reg = mcp251xfd_read_crc(can_dev, MCP251XFD_REG_FIFOCON(MCP251XFD_RX_FIFO_IDX), MCP251XFD_REG_SIZE);
    if (!reg) {
        LOG_ERR("Reg read failure");
    }
    *reg = sys_le32_to_cpu(*reg);
    LOG_INF("MCP251XFD_REG_FIFOCON(MCP251XFD_RX_FIFO_IDX) 0x%08x", *reg);
}

SHELL_CMD_ARG_REGISTER(mcp_read, NULL, "mcp read", mcp_read_shell, 1, 0);

This should result in the following registers (WORKING situation):

[00:00:07.134,674] <inf> can_mcp251xfd: MCP251XFD_REG_CON 0x00180720
[00:00:06.643,951] <inf> can_mcp251xfd: MCP251XFD_REG_CON_OPMOD: 0
[00:00:07.134,826] <inf> can_mcp251xfd: MCP251XFD_REG_OSC 0x00000460
[00:00:07.134,979] <inf> can_mcp251xfd: MCP251XFD_REG_IOCON 0x03030003
[00:00:07.135,101] <inf> can_mcp251xfd: MCP251XFD_REG_INT 0x201a0000
[00:00:07.135,253] <inf> can_mcp251xfd: MCP251XFD_REG_TDC 0x00000000
[00:00:07.135,406] <inf> can_mcp251xfd: MCP251XFD_REG_TEFCON 0x07000001
[00:00:07.135,559] <inf> can_mcp251xfd: MCP251XFD_REG_TXQCON 0x07600080
[00:00:07.135,681] <inf> can_mcp251xfd: MCP251XFD_REG_FIFOCON(MCP251XFD_RX_FIFO_IDX) 0x0f000001

On initialization failure they look like:

[00:00:06.699,981] <inf> can_mcp251xfd: MCP251XFD_REG_CON 0x00180760
[00:00:08.607,330] <inf> can_mcp251xfd: MCP251XFD_REG_CON_OPMOD: 0
[00:00:06.700,134] <inf> can_mcp251xfd: MCP251XFD_REG_OSC 0x00000460
[00:00:06.700,256] <inf> can_mcp251xfd: MCP251XFD_REG_IOCON 0x03030003
[00:00:06.700,408] <inf> can_mcp251xfd: MCP251XFD_REG_INT 0x00000008
[00:00:06.700,561] <inf> can_mcp251xfd: MCP251XFD_REG_TDC 0x00021000
[00:00:06.700,683] <inf> can_mcp251xfd: MCP251XFD_REG_TEFCON 0x00000000
[00:00:06.700,836] <inf> can_mcp251xfd: MCP251XFD_REG_TXQCON 0x00600080
[00:00:06.700,988] <inf> can_mcp251xfd: MCP251XFD_REG_FIFOCON(MCP251XFD_RX_FIFO_IDX) 0x00600000

Again, this is caused by a failure to read on: https://github.com/zephyrproject-rtos/zephyr/blob/4b6d3f1e16ea8ccccbf85f1eaf7efd0caea11766/drivers/can/can_mcp251xfd.c#L1603 Which causes the whole initialization sequence to be silently skipped.

Expected behavior First of to not fail the initialization of mcp251863. Second to not silently continue if initialization fails.

Impact This is a showstopper for us. I'm currently trying to find out how to detect and correct this without modifying the zephyr code base.

Environment (please complete the following information):

github-actions[bot] commented 1 month ago

Hi @melvinvdb! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

henrikbrixandersen commented 1 month ago

Thank you for the detailed bug report. Are you able to submit a pull request for the initialization fix already tested?

melvinvdb commented 1 month ago

Created a tested pull request!

talih0 commented 1 month ago

@melvinvdb What is the clock source for the mcp25xxfd? Is it stable after reset?

henrikbrixandersen commented 1 month ago

@melvinvdb Ping?

melvinvdb commented 1 month ago

Sorry for the late reply our project has put a lower priority on this. I've checked and the clock source stays stable even when resetting. The mcp has it's own clock at 40 mhz. This clock automatically starts when power is applied (even before any spi command is given). The 40 mhz clock isn't interrupted when resetting the mcp. Resetting doesn't seem to have any effect on the clock.