zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.84k stars 6.6k forks source link

net: nxp_enet: dhcpv4_client sample never finds an IP and doesn't require a cable to be connected for socket functions to pass: mimxrt1050_evk. #76437

Open jtlas19 opened 3 months ago

jtlas19 commented 3 months ago

Describe the bug

Since the new ENET NXP driver has recently been officially released in Zephyr v3.7, I have been investigating updating from our version which is currently at commit 64f41309d104b161b1960bcc20e2ff6bd58c6355 . On our version, Ethernet works okay in regards to general use with some minor bugs that should be included in another bug report, they are irrelevant to this issue. Moving forward to today, I updated our Zephyr version to the v3.7 commit 36940db938a8f4a1e919496793ed439850a221c2 , this prompted Ethernet to not work at all. Every POSIX network related function seems to pass and return successfully (even with errno's indicating success) when forcefully binding the socket to the Ethernet iface, even when there is no physical cable connected.

Trying the dhcpv4_client example on our RT1050 Development board prompted the PHY driver to detect the Ethernet cable successfully and decide on a MAC address, but even after leaving the board on for up to 30 minutes it never finds an IP address and it also thinks there is a cable connected when there isn't at all. I have confirmed our Ethernet socket is live by plugging it into my Laptop which is able to use it completely fine.

To Reproduce

Steps to reproduce the behavior:

  1. Build dhcpv4_client sample for the mimxrt1050_evk
  2. Flash it
  3. See that it never gets an IP address and there is no need for an Ethernet cable for the POSIX network functions to pass

Expected behavior

I expect it to work like an Ethernet driver should, like It did on v3.6.

Impact

This bug completely breaks our network stack, as we have three different network interfaces on our board with ethernet being the highest priority in the stack. Since the socket functions pass without issue when the Ethernet is very clearly down, this breaks our whole network stack

Logs and console output

[00:00:01.001,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_autonegotiate: PHY (0) is entering autonegotiation sequence
[00:00:01.102,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_autonegotiate: PHY (0) autonegotiation completed
[00:00:01.103,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_get_link: PHY 0 is up
[00:00:01.103,000] <dbg> phy_mc_ksz8081: phy_mc_ksz8081_get_link: PHY (0) Link speed 100 Mb, full duplex

[00:00:01.103,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:00:01.103,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

[00:00:01.104,000] <dbg> eth_nxp_enet_mac: eth_nxp_enet_init: ethernet MAC 02:04:9f:94:78:98
*** Booting Zephyr OS build v3.7.0 ***
[00:00:01.104,000] <inf> net_dhcpv4_client_sample: Run dhcpv4 client
[00:00:01.104,000] <inf> net_dhcpv4_client_sample: Start on ethernet: index=1
uart:~$ net iface

Interface eth0 (0x800004e8) (Ethernet) [1]
===================================
Link addr : 02:04:9F:94:78:98
MTU       : 1500
Flags     : AUTO_START,IPv4
Device    : ethernet (0x6001dd20)
Ethernet capabilities supported:
        TX checksum offload
        RX checksum offload
        10 Mbits
        100 Mbits
        MAC address filtering
IPv4 unicast addresses (max 1):
        <none>
IPv4 multicast addresses (max 2):
        224.0.0.1
IPv4 gateway : 0.0.0.0
DHCPv4 lease time : 0
DHCPv4 renew time : 0
DHCPv4 server     : 0.0.0.0
DHCPv4 requested  : 0.0.0.0
DHCPv4 state      : selecting
DHCPv4 attempts   : 8
uart:~$

Environment (please complete the following information):

Additional context

github-actions[bot] commented 3 months ago

Hi @jtlas19! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

jukkar commented 3 months ago

I just tried this with my imxrt1050-evkb board, DHCPv4 worked just fine

[00:00:04.012,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:00:04.012,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

*** Booting Zephyr OS build v3.7.0 ***
[00:00:04.013,000] <inf> net_dhcpv4_client_sample: Run dhcpv4 client
[00:00:04.013,000] <inf> net_dhcpv4_client_sample: Start on ethernet: index=1
[00:00:10.604,000] <inf> net_dhcpv4_client_sample: DHCP Option 42: 193.66.253.82
[00:00:10.605,000] <inf> net_dhcpv4: Received: 192.168.88.34
[00:00:10.605,000] <inf> net_dhcpv4_client_sample:    Address[1]: 192.168.88.34
[00:00:10.605,000] <inf> net_dhcpv4_client_sample:     Subnet[1]: 255.255.255.0
[00:00:10.605,000] <inf> net_dhcpv4_client_sample:     Router[1]: 192.168.88.1
[00:00:10.605,000] <inf> net_dhcpv4_client_sample: Lease time[1]: 600 seconds
uart:~$ 
uart:~$ net iface
Hostname: zperf

Interface eth0 (0x80000540) (Ethernet) [1]
===================================
Link addr : 02:04:9F:6A:2B:CC
MTU       : 1500
Flags     : AUTO_START,IPv4
Device    : ethernet (0x60022754)
Ethernet capabilities supported:
        TX checksum offload
        RX checksum offload
        10 Mbits
        100 Mbits
        MAC address filtering
IPv4 unicast addresses (max 1):
        192.168.88.34/255.255.255.0 DHCP preferred
IPv4 multicast addresses (max 2):
        224.0.0.1
        224.0.0.251
IPv4 gateway : 192.168.88.1
DHCPv4 lease time : 600
DHCPv4 renew time : 300
DHCPv4 server     : 192.168.88.1
DHCPv4 requested  : 192.168.88.34
DHCPv4 state      : bound
DHCPv4 attempts   : 1
uart:~$ 
jtlas19 commented 3 months ago

Interesting, I have tried again and can get it to work with commit 3bcaa6f8d63136ffc2ae37b6e77ee4e1cda6964f (latest main as of this morning) but still not v3.7.

Additionally, I am having issues with using more than one interface at once with the latest main. I have another interface that is functioning correctly, but ethernet takes priority and is also working by itself. If I have both of them together I get -11 from zsock_getaddrinfo but no errors returned from zsock_socket, zsock_setsockopt or zsock_connect. Any idea? It is breaking our network stack :)

jukkar commented 3 months ago

I also tried with latest upstream, and it worked there too for me.

jtlas19 commented 3 months ago

I also tried with latest upstream, and it worked there too for me.

Unsure if it makes a difference but you are using a revision B chip where as ours is revivison A. That's the only difference i can see. Any ideas on my previous comment?

Cheers

jukkar commented 3 months ago

I have REV A1 of the board, dated 2018

jukkar commented 3 months ago

About the getaddrinfo error, difficult to know what goes wrong there by just seeing the errno.

decsny commented 3 months ago

Interesting, I have tried again and can get it to work with commit 3bcaa6f (latest main as of this morning) but still not v3.7.

Additionally, I am having issues with using more than one interface at once with the latest main. I have another interface that is functioning correctly, but ethernet takes priority and is also working by itself. If I have both of them together I get -11 from zsock_getaddrinfo but no errors returned from zsock_socket, zsock_setsockopt or zsock_connect. Any idea? It is breaking our network stack :)

it's known that nxp ethernet doesn't work on 3.7 release, because people were wanting to rush a bunch of changes last minute that allowed bugs with less time for testing, try with latest main. hopefully 3.7.1 can be released soon with fixed nxp ethernet.

decsny commented 3 months ago

if you need to work on 3.7 try cherry picking this commit 2af4aa8045af37e1a718ed5f166313ac338d70cf to see if it fixes it

jukkar commented 3 months ago

it's known that nxp ethernet doesn't work on 3.7 release

I tried the DHCPv4 sample with v3.7 and I could get the IP address, so I was probably just lucky then. I did not try to actively use the Ethernet so it might have failed at that point.

jtlas19 commented 3 months ago

if you need to work on 3.7 try cherry picking this commit 2af4aa8 to see if it fixes it

I have tried latest main with that change in and the behavior is different. I get an IP, i can use ethernet correctly but even without the PHY being up and a cable being plugged in, the socket functions pass. Additionally, i still an unable to use any other network interface with Ethernet being up/ready for use. As i mentioned here:

Interesting, I have tried again and can get it to work with commit 3bcaa6f (latest main as of this morning) but still not v3.7.

Additionally, I am having issues with using more than one interface at once with the latest main. I have another interface that is functioning correctly, but ethernet takes priority and is also working by itself. If I have both of them together I get -11 from zsock_getaddrinfo but no errors returned from zsock_socket, zsock_setsockopt or zsock_connect. Any idea? It is breaking our network stack :)

jukkar commented 3 months ago

no errors returned from zsock_socket, zsock_setsockopt or zsock_connect

Do you have UDP or TCP socket? If UDP, then those calls do not really send anything to peer, for TCP the connect should fail if it cannot connect.

jtlas19 commented 3 months ago

no errors returned from zsock_socket, zsock_setsockopt or zsock_connect

Do you have UDP or TCP socket? If UDP, then those calls do not really send anything to peer, for TCP the connect should fail if it cannot connect.

They are TCP sockets. For example, we are opening a socket, binding it to Ethernet, connecting to it then closing the socket and this passes fine.

jtlas19 commented 3 months ago

no errors returned from zsock_socket, zsock_setsockopt or zsock_connect

Do you have UDP or TCP socket? If UDP, then those calls do not really send anything to peer, for TCP the connect should fail if it cannot connect.

They are TCP sockets. For example, we are opening a socket, binding it to Ethernet, connecting to it then closing the socket and this passes fine.

I am unsure if this issue is somehow related to the fact that with any other interface active while Ethernet is up, networking related functions like getaddrinfo etc do not work. But when you take one of the interfaces away, it works again. This behavior did not exist on v3.6 release, we had all of our interfaces working correctly at the same time, including Ethernet :)

jukkar commented 3 months ago

I am trying to figure out how to replicate your getaddrinfo problem, is the same issue seen for example with dns_resolve sample, @jtlas19 could you try it?

jtlas19 commented 3 months ago

I am trying to figure out how to replicate your getaddrinfo problem, is the same issue seen for example with dns_resolve sample, @jtlas19 could you try it?

I will give it a try this afternoon and report back 😄

jtlas19 commented 3 months ago

@jukkar I have just given it a go and I get the same behaviour as I am experiencing! The sample seems to pass with or without an Ethernet cable plugged in with the same result 😆

Result from with the cable plugged in:


[00:00:01.103,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:00:01.103,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

*** Booting Zephyr OS build v3.7.0-309-g6211de87c066 ***
[00:00:01.105,000] <inf> net_config: Initializing network
[00:00:01.105,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:01.205,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:01.205,000] <inf> net_dns_resolve_client_sample: Starting DNS resolve sample
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_ipv4_lookup: DNS id 27038
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv4_lookup: Doing mDNS IPv4 query
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv4_lookup: mDNS v4 query sent
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_ipv6_lookup: DNS id 50170
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv6_lookup: Doing mDNS IPv6 query
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv6_lookup: mDNS v6 query sent
[00:00:03.205,000] <inf> net_dns_resolve_client_sample: DNS query was canceled
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: DNS query was canceled
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: mDNS query was canceled
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: mDNS query was canceled
uart:~$
uart:~$
uart:~$ net iface

Interface eth0 (0x80000b10) (Ethernet) [1]
===================================
Link addr : 02:04:9F:22:D1:6D
MTU       : 1500
Flags     : AUTO_START,IPv4,IPv6
Device    : ethernet (0x60024080)
Ethernet capabilities supported:
        10 Mbits
        100 Mbits
        MAC address filtering
Ethernet PHY device: phy@0 (0x6002406c)
Ethernet link speed: 100 Mbits full-duplex
IPv6 unicast addresses (max 3):
        fe80::4:9fff:fe22:d16d autoconf preferred infinite
        2001:db8::1 manual preferred infinite
IPv6 multicast addresses (max 5):
        ff02::1
        ff02::1:ff22:d16d
        ff02::1:ff00:1
IPv6 prefixes (max 2):
        <none>
IPv6 hop limit           : 64
IPv6 base reachable time : 30000
IPv6 reachable time      : 40200
IPv6 retransmit timer    : 0
IPv4 unicast addresses (max 1):
        192.0.2.1/255.255.255.0 manual preferred infinite
IPv4 multicast addresses (max 2):
        224.0.0.1
IPv4 gateway : 0.0.0.0

Result from test without the cable plugged in:


[00:00:01.103,000] <inf> phy_mc_ksz8081: PHY 0 is up
[00:00:01.103,000] <inf> phy_mc_ksz8081: PHY (0) Link speed 100 Mb, full duplex

*** Booting Zephyr OS build v3.7.0-309-g6211de87c066 ***
[00:00:01.105,000] <inf> net_config: Initializing network
[00:00:01.105,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:01.205,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:01.205,000] <inf> net_dns_resolve_client_sample: Starting DNS resolve sample
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_ipv4_lookup: DNS id 61834
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv4_lookup: Doing mDNS IPv4 query
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv4_lookup: mDNS v4 query sent
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_ipv6_lookup: DNS id 50992
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv6_lookup: Doing mDNS IPv6 query
[00:00:01.206,000] <dbg> net_dns_resolve_client_sample: do_mdns_ipv6_lookup: mDNS v6 query sent
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: DNS query was canceled
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: DNS query was canceled
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: mDNS query was canceled
[00:00:03.206,000] <inf> net_dns_resolve_client_sample: mDNS query was canceled
uart:~$ net iface

Interface eth0 (0x80000b10) (Ethernet) [1]
===================================
Link addr : 02:04:9F:09:D0:86
MTU       : 1500
Flags     : AUTO_START,IPv4,IPv6
Device    : ethernet (0x60024080)
Ethernet capabilities supported:
        10 Mbits
        100 Mbits
        MAC address filtering
Ethernet PHY device: phy@0 (0x6002406c)
Ethernet link speed: 100 Mbits full-duplex
IPv6 unicast addresses (max 3):
        fe80::4:9fff:fe09:d086 autoconf preferred infinite
        2001:db8::1 manual preferred infinite
IPv6 multicast addresses (max 5):
        ff02::1
        ff02::1:ff09:d086
        ff02::1:ff00:1
IPv6 prefixes (max 2):
        <none>
IPv6 hop limit           : 64
IPv6 base reachable time : 30000
IPv6 reachable time      : 16960
IPv6 retransmit timer    : 0
IPv4 unicast addresses (max 1):
        192.0.2.1/255.255.255.0 manual preferred infinite
IPv4 multicast addresses (max 2):
        224.0.0.1
IPv4 gateway : 0.0.0.0
uart:~$

I am testing with latest main (6211de87c0660a5dfd21b5decc93a67236ec0984)

jukkar commented 3 months ago

Right after boot the cable detection does not work but the interface is up even if cable is not connected. After connecting and then disconnecting the cable, the network interface is taken down as expected. @decsny could you take a look at this, this looks like a driver issue.

jukkar commented 3 months ago

@jtlas19 is your DNS issue perhaps just related to the fact that the cable detection is not working properly. If you feel it is something else, then I would need a way to replicate the issue, in which case a simple app or code snippet showing the failure would be appreciated.

jtlas19 commented 3 months ago

Hello, I apologise for the late response. You mention about the cable detection not working correctly? Is this a known issue or something similar? I do not think that is my issue at the moment no since I have had the cable detection working correctly. My issue is that all POSIX socket functions return successfully even when no Ethernet cable is connected. But the cable detection works fine as I can plug a cable in and it bring the link up.

Any sample that uses a basic open socket, set socket opts, connect to socket then closure of socket. Please let me know your findings. Thanks!!

decsny commented 3 months ago

Hello, I apologise for the late response. You mention about the cable detection not working correctly? Is this a known issue or something similar? I do not think that is my issue at the moment no since I have had the cable detection working correctly. My issue is that all POSIX socket functions return successfully even when no Ethernet cable is connected. But the cable detection works fine as I can plug a cable in and it bring the link up.

Any sample that uses a basic open socket, set socket opts, connect to socket then closure of socket. Please let me know your findings. Thanks!!

what is the most basic sample with which you can reproduce this behavior

jtlas19 commented 3 months ago

Hello, I apologise for the late response. You mention about the cable detection not working correctly? Is this a known issue or something similar? I do not think that is my issue at the moment no since I have had the cable detection working correctly. My issue is that all POSIX socket functions return successfully even when no Ethernet cable is connected. But the cable detection works fine as I can plug a cable in and it bring the link up. Any sample that uses a basic open socket, set socket opts, connect to socket then closure of socket. Please let me know your findings. Thanks!!

what is the most basic sample with which you can reproduce this behavior

I will create a sample of what I mentioned about opening a socket etc and test, If I can reproduce the issue with it then I will send it over. 😄

github-actions[bot] commented 1 month ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.