zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.49k stars 6.42k forks source link

drivers: mdio: mdio_nxp_enet: Link goes up and down sporadically #76446

Open decsny opened 1 month ago

decsny commented 1 month ago

NXP ENET MDIO driver has a bug where interrupt is not cleared properly causing phy link to go up and down sporadically.

yuecelm commented 1 month ago

I tried the bugfix as a cherry pick on the 3.7.0 release. Tested on board mimxrt1050_evk. Unfortunately the link downs are still happening:

[00:05:11.926,000] <inf> eth_nxp_enet_mac: Link is down
[00:05:12.526,000] <inf> eth_nxp_enet_mac: Link is up
[00:05:12.526,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:05:12.526,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

what is strange that I dont observe the log "phy_mc_ksz8081: PHY 0 is down" here like when I really unplug the cable

decsny commented 1 month ago

I tried the bugfix as a cherry pick on the 3.7.0 release. Tested on board mimxrt1050_evk. Unfortunately the link downs are still happening:

[00:05:11.926,000] <inf> eth_nxp_enet_mac: Link is down
[00:05:12.526,000] <inf> eth_nxp_enet_mac: Link is up
[00:05:12.526,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:05:12.526,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

what is strange that I dont observe the log "phy_mc_ksz8081: PHY 0 is down" here like when I really unplug the cable

can you please provide how to reproduce

yuecelm commented 1 month ago

I tried to reproduce it with samples/net/cloud/aws_iot_mqtt, but no success yet. My application does also include 2 UARTs for Modbus and M-Bus sensors and a SDHC card (littlefs formatted). I can say that my application did not had this link down behavior with zephyr release 3.6.0 and the mcux ethernet driver. When I use release 3.7.0 and switch back to deprecated mcux ethernet driver, I dont observe the link down anymore (>1 day uptime without link down). I try next week again to reproduce the link down behavior with a minimal example.

yuecelm commented 2 weeks ago

I could not reproduce this behaviour with a minimal application. Currently I switched back to mcux ethernet driver, so I could work further on the 3.7.0 release. I attached the overlay file for this purpose (remove .txt ending, was necessary for upload): nxp,kinetis-ethernet.overlay.txt When there are some other fixes to try out, I can test them out.

szczys commented 2 weeks ago

I am facing the same issue when performing a firmware upgrade using the mimxrt1024_evk board.

The link will go down and back up again but will not regain connectivity:

[00:00:30.771,000] <inf> fw_block_processor: Downloading block index 174 (175/267)
[00:00:30.869,000] <inf> fw_block_processor: Downloading block index 175 (176/267)
[00:00:30.913,000] <inf> fw_block_processor: Downloading block index 176 (177/267)
[00:00:30.962,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_get_link: PHY 0 is down
[00:00:30.962,000] <inf> eth_nxp_enet_mac: Link is down
[00:00:31.463,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_autonegotiate: PHY (0) is entering autonegotiation sequence
[00:00:31.563,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_autonegotiate: PHY (0) autonegotiation completed
[00:00:31.563,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_get_link: PHY 0 is up
[00:00:31.563,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_get_link: PHY (0) Link speed 100 Mb, full duplex

[00:00:31.563,000] <inf> eth_nxp_enet_mac: Link is up
[00:00:31.563,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:00:31.563,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

[00:00:33.671,000] <wrn> golioth_coap_client: 1 resends in last 10 seconds
[00:00:46.307,000] <wrn> golioth_coap_client: 29 resends in last 10 seconds
[00:00:57.913,000] <wrn> golioth_coap_client: 17 resends in last 10 seconds
[00:01:00.912,000] <wrn> golioth_coap_client_zephyr: Receive timeout
[00:01:00.912,000] <inf> golioth_coap_client_zephyr: Ending session
[00:01:00.912,000] <inf> fw_update_sample: Golioth client disconnected
[00:01:00.912,000] <wrn> fw_block_processor: Failed to get block, will retry. Status: 1
[00:01:03.914,000] <err> golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -101
[00:01:03.914,000] <err> golioth_coap_client_zephyr: Failed to connect: -11
[00:01:03.914,000] <wrn> golioth_coap_client_zephyr: Failed to connect: -11

I tried applying the patch that was suggested by @decsny in a similar issue but it did not resolve the problem.

By crafting an overlay file for this board based in @yuecelm's suggestion the old driver has things working once again.

@decsny This behavior is 100% reproducible using the fw_update example in our SDK (it must be a branch like this one based on our upcoming Zephyr v3.7.0 release). If you'd like help setting up the example or testing/debugging I'm happy to join in. I'm 'szczys' on discord or email mike at golioth dot io

tonyarkles commented 1 day ago

I ran into this yesterday as well. The fix I did was to verify that the EIMR flag is actually set in the ISR:

static void nxp_enet_mdio_isr_cb(const struct device *dev)
{
    struct nxp_enet_mdio_data *data = dev->data;

    if ((data->base->EIR & ENET_EIR_MII_MASK) && (data->base->EIMR & ENET_EIMR_MII_MASK)) {
      /* Signal that operation finished */
      k_sem_give(&data->mdio_sem);
    }

    /* Disable the interrupt */
    data->base->EIMR &= ~ENET_EIMR_MII_MASK;
}

Happy to try the other patch from this thread as well since I've got an application where the issue is readily reproducible.