zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.93k stars 6.65k forks source link

LAN865x not working with RT1170 & RT1064 (likely caused by LPSPI issues) #69658

Open mariopaja opened 8 months ago

mariopaja commented 8 months ago

We have been testing the LAN865x driver and we have had problems with RT1064 and RT1170.

Our main focus was NXP RT1064 but we were not able to implement LAN865x and therefore ended up testing it with several MCU to see what is going wrong.

From the table below it seems that LAN865x works totally fine with other ARM, RISK-V & XTENSA MCUs but not with RT1064 & RT1170

DEV Board Result
NUCLEO-F207ZG OK
NUCLEO-F429ZI OK
NUCLEO-L552ZE-Q OK
NUCLEO-U575ZI-Q OK
NUCLEO-H745ZI-Q OK
Arduino Giga R1 OK
B-U585I-IOT02A OK
XIAO ESP32C3 OK
XIAO ESP32S3 OK
RT1064 FAILED
RT1170 FAILED

XIAO ESP32C3 esp32c3

XIAO ESP32S3 xiao_esp32s3

NUCLEO H563 h563

RT1170 LAN865x rt1170_lan865x

RT1064 rt1064

Additionally a test with W5500 module RT1170 W5500 rt1170_w5500

aescolar commented 8 months ago

CC @decsny @lmajewski This driver seems to have been written by @lmajewski but he does not seem to be a Zephyr collaborator so I cannot assign it to him

decsny commented 8 months ago

If you can narrow down if it is an LPSPI issue or LAN865 issue that would be helpful because I don't have a LAN865, and the author of the driver most likely does not have a platform with LPSPI. Since you said it works on other platforms, maybe you can compare the SPI signals with a signal analyzer between those platforms and the RT1064 to see where is the difference as your next step in debugging.

lmajewski commented 8 months ago

@mariopaja - I only had the Nucleo Devel board for T1S development. I've tried to use the Zephyr kernel's well defined APIs, so I assume that semaphores shall work in the same way cross platform.

From the above comment - XIAO ESP32C3 is a RISC-V SoC, so it looks like at least on other SoC it works.

Could you share which HW revision of LAN8651 do your have? (B0 or B1) With B0 for example I had issues with SW reset - hence the GPIO reset is used (and recommended). As advised by @decsny - comparing the SPI signals would be a good starting point.

Just a side note - I shall have come back to this driver in short time, as the full duplex transmission needs to be finished (now it is a very simple, working half-duplex driver).

mariopaja commented 8 months ago

@lmajewski

with RISK-V yes. I would definitely have to do a second test round with all the MCUs. We are using B0 variant. And btw thank you a lot for the support of this driver :)

@decsny

I am also suspicious on the driver itself because I have faced issues with the chip select, and there are other developers that are mentioning that they are facing difficulties like https://github.com/zephyrproject-rtos/zephyr/discussions/69692#discussion-6300901.

These made me a bit suspicious and confused because these two issues are not directly related to each other. I will be able to provide the SPI signals for STM, NXP, & ESP32-S3/C3 by the end of the next working week.

decsny commented 8 months ago

I am also suspicious on the driver itself because I have faced issues with the chip select, and there are other developers that are mentioning that they are facing difficulties like #69692 (comment).

There is this issue with the LPSPI CS tracked here: https://github.com/zephyrproject-rtos/zephyr/issues/16544 , maybe consider if this is a cause

mariopaja commented 8 months ago

@lmajewski lan865x driver seems to be fine. I was able to make it work with also with xiao_esp32s3 (XTENSA).

@decsny I checked the rt1170 and the signal seems kind of inverted. Maybe it is not a bug, maybe I have configured the SPI wrong :/ At the moment I am using the following configuration:

&lpspi1 {
  status = "okay";
  cs-gpios = <&gpio3 28 GPIO_ACTIVE_LOW>; /* D10 */
  eth0: lan865x@0{
    status = "okay";
    compatible = "microchip,lan865x";
    reg = <0>;
    int-gpios = <&gpio3 6 GPIO_ACTIVE_LOW>; /* D9 */
    rst-gpios = <&gpio3 0 GPIO_ACTIVE_LOW>; /* D8 */
    local-mac-address = [00 19 05 00 00 04];
    spi-max-frequency = <20000000>;
    plca-enable;
    plca-node-id = <3>;
    plca-node-count = <8>;
    plca-burst-count = <0x0>;
    plca-burst-timer = <0x80>;
    plca-to-timer = <0x20>;

  };
};
mariopaja commented 7 months ago

@lmajewski

I was able to point the error to: https://github.com/zephyrproject-rtos/zephyr/blob/ca527d9f12bceb23d8ace958233edcdd63de3db7/drivers/ethernet/oa_tc6.c#L309

before getting the following error: spi_mcux_lpspi: Transfer could not start on spi@40394000: 4

and is is either transmitting the first 3 bytes of the header and then the SPI stops transmitting or the SPI doesnt start the transmittion at all like the image below:

rt1064_2024_04_10_0

@decsny It also seems like the SPI is transmitting 16 bytes at a time without any delay between each byte transmission (image below). Maybe this is creating a bottleneck.

rt1064_2024_04_10_1

Is there a possibility to lower the transmission speed between the bytes and make it look more like the behaviour of STM32 H563 (image below)?

h563_2024_03_20

mariopaja commented 7 months ago

@lmajewski for further testing I added the following code to https://github.com/zephyrproject-rtos/zephyr/blob/ca527d9f12bceb23d8ace958233edcdd63de3db7/drivers/ethernet/oa_tc6.c#L272 and only then I was able to transmit the spi chunk and deassert the interrupt.

...
#ifdef CONFIG_BOARD_MIMXRT1064_EVK
        int ret;

        uint8_t tx_cmd[OA_TC6_HDR_SIZE+64]={0};
        hdr = sys_cpu_to_be32(hdr);
        //LOG_ERR("hdr: %x", hdr);

        //tx_cmd[0]=hdr[0];
        tx_cmd[3]=(hdr >> 24) & 0xFF;
        tx_cmd[2]=(hdr >> 16) & 0xFF;
        tx_cmd[1]=(hdr >> 8) & 0xFF;
        tx_cmd[0]=(hdr) & 0xFF;

        if(buf_tx!=NULL){
            //LOG_ERR("buf_tx!=NULL");
            //TODO: Do smth
            for(int i=4;i<68;i++){
                //LOG_ERR("buf_tx: 0x%x",*buf_tx);
                tx_cmd[i]=*buf_tx;
                buf_tx++;
            }

        }

        struct spi_buf tx_buf={.buf = &tx_cmd, .len = ARRAY_SIZE(tx_cmd),};
        struct spi_buf_set tx = {.buffers = &tx_buf, .count = 1,};

        uint8_t rx_cmd[64+OA_TC6_FTR_SIZE]={0};

        if(buf_rx!=NULL){
            LOG_ERR("buf_rx!=NULL");
            //TODO: Do smth
        }

        const struct spi_buf rx_buf = {.buf = rx_cmd, .len = ARRAY_SIZE(rx_cmd),};
        const struct spi_buf_set rx = {.buffers = &rx_buf, .count = 1,};

        ret = spi_transceive_dt(tc6->spi, &tx, &rx);
        if (ret < 0) {
            return ret;
        }
        //LOG_ERR("rx_cmd[64]: %x", rx_cmd[64]);

        uint32_t data;
        data=rx_cmd[67];
        data|=rx_cmd[66]<<8;
        data|=rx_cmd[65]<<16;
        data|=rx_cmd[64]<<24;

        *ftr=data;

#else

...
mariopaja commented 7 months ago

@lmajewski I finished the patch today for the rt1064 and I could get the expected behaviour and transmit & recevie packets.

@decsny Is there any particular reason why it would work with one and not with the other?

int oa_tc6_chunk_spi_transfer(struct oa_tc6 *tc6, uint8_t *buf_rx, uint8_t *buf_tx, uint32_t hdr,
                  uint32_t *ftr)
{
    // Correct behaviour
#ifdef CONFIG_BOARD_MIMXRT1064_EVK
    int ret;

    uint8_t tx_cmd[OA_TC6_HDR_SIZE + 64] = {0};
    hdr = sys_cpu_to_be32(hdr);
    // LOG_ERR("hdr: %x", hdr);

    // tx_cmd[0]=hdr[0];
    tx_cmd[3] = (hdr >> 24) & 0xFF;
    tx_cmd[2] = (hdr >> 16) & 0xFF;
    tx_cmd[1] = (hdr >> 8) & 0xFF;
    tx_cmd[0] = (hdr) & 0xFF;

    if (buf_tx != NULL) {
        for (int i = 4; i < 68; i++) {
            tx_cmd[i] = *buf_tx;
            buf_tx++;
        }
    }

    struct spi_buf tx_buf = {
        .buf = &tx_cmd,
        .len = ARRAY_SIZE(tx_cmd),
    };
    struct spi_buf_set tx = {
        .buffers = &tx_buf,
        .count = 1,
    };

    uint8_t rx_cmd[64 + OA_TC6_FTR_SIZE] = {0};

    const struct spi_buf rx_buf = {
        .buf = rx_cmd,
        .len = ARRAY_SIZE(rx_cmd),
    };
    const struct spi_buf_set rx = {
        .buffers = &rx_buf,
        .count = 1,
    };

    ret = spi_transceive_dt(tc6->spi, &tx, &rx);
    if (ret < 0) {
        return ret;
    }

    for (int i = 0; i < 64; i++) {
        *buf_rx = rx_cmd[i];
        buf_rx++;
    }

    uint32_t data;
    data = rx_cmd[67];
    data |= rx_cmd[66] << 8;
    data |= rx_cmd[65] << 16;
    data |= rx_cmd[64] << 24;

    *ftr = data;

#else
    struct spi_buf tx_buf[2];
    struct spi_buf rx_buf[2];
    struct spi_buf_set tx;
    struct spi_buf_set rx;
    int ret;

    hdr = sys_cpu_to_be32(hdr);
    tx_buf[0].buf = &hdr;
    tx_buf[0].len = sizeof(hdr);

    LOG_ERR("sizeof(hdr): %d", sizeof(hdr));

    tx_buf[1].buf = buf_tx;
    tx_buf[1].len = tc6->cps;
    LOG_ERR("tc6->cps: %d", tc6->cps);

    tx.buffers = tx_buf;
    tx.count = ARRAY_SIZE(tx_buf);

    rx_buf[0].buf = buf_rx;
    rx_buf[0].len = tc6->cps;

    rx_buf[1].buf = ftr;
    rx_buf[1].len = sizeof(*ftr);

    rx.buffers = rx_buf;
    rx.count = ARRAY_SIZE(rx_buf);

    ret = spi_transceive_dt(tc6->spi, &tx, &rx);
    if (ret < 0) {
        return ret;
    }
    *ftr = sys_be32_to_cpu(*ftr);

#endif

    return oa_tc6_update_status(tc6, *ftr);
}
lmajewski commented 7 months ago

@mariopaja - Am I correct, that you need to delay the transmission start between sending first byte and the CS being low? As well as your's CPU (to talk with LAN965x) requires some time between sending each byte?

Is there a possibility to lower the transmission speed between the bytes

The SPI transmission shall not impact the TC6 driver - it shall be tuned in the CPU setup code. It looks like the issue is either with the way LAN8651 is communicating via SPI or with the NXP's RT1170 HAL's SPI driver.

Could you check if LAN8651 documentation has any requirements for it? Maybe you's SPI setup (like polarity or CLK frequency) is different than that required?

Could you also check if lowering the SPI frequency helps with the communication stability on your setup?

mariopaja commented 7 months ago

@lmajewski I think it has to do with the MCUX SPI HAL, because the issue arises as TC6 tries to transmit 4 bytes of the header.

It looks like the issue is either with the way LAN8651 is communicating via SPI or with the NXP's RT1170 HAL's SPI driver.

For this reason I circumvented this issue with the patch here: https://github.com/zephyrproject-rtos/zephyr/issues/69658#issuecomment-2048975684. Instead of providing two buffers at different lengths I give one big buffer to SPI. As a solution it may be enough for our case but it is at no means a good solution.

lmajewski commented 7 months ago

@mariopaja - I do have an impression that the patch you mentioned would we OK only for some short time.... In the long term IMHO it would be better to have this issue investigated and correctly fixed in the HAL. Maybe NXP's maintainers (especially for HAL) shall be informed about the issue? Maybe they could look on this problem?

github-actions[bot] commented 5 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

decsny commented 5 months ago

@mariopaja I have been looking into the lpspi driver lately because of various other issues, and I have a feeling they are related to your issue with this. For example, the chip select is not remaining asserted throughout the whole transfer, that could be the issue, or a host of other ones that I am finding. Basically, the lpspi driver is not even implementing the zephyr api correctly in my opinion at all. I am doing a major rework of this driver that probably will not show up until after LTS because it is too major of a change for this current release cycle.

github-actions[bot] commented 3 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.