frdm_k64f: Sometimes, there may be 1s delay for pings and other Ethernet packets

nashif commented 7 years ago

Reported by Paul Sokolovsky:

I never saw this issue with QEMU, so theorize that it's frdm_k64t Ethernet driver issue.

Sometimes, instead of usual sub-millisecond pings:

64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.788 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.476 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=0.295 ms
64 bytes from 192.0.2.1: icmp_seq=4 ttl=64 time=0.475 ms

you're starting to get:

64 bytes from 192.0.2.1: icmp_seq=88 ttl=64 time=1008 ms
64 bytes from 192.0.2.1: icmp_seq=89 ttl=64 time=1008 ms
64 bytes from 192.0.2.1: icmp_seq=90 ttl=64 time=493 ms
64 bytes from 192.0.2.1: icmp_seq=91 ttl=64 time=1007 ms

So far, I wasn't able to find an exact way to reproduce, but it happens often. Say, I see it in every testing session (5-6 board resets/Ethernet cable reconnects).

When this happens, all other packets seem to be delayed too, not just pings. For example, using netcat UDP with echo_server shows the similar effect of 1s delay.

(Imported from Jira ZEP-1678)

nashif commented 7 years ago

by Paul Sokolovsky:

As a bit of more info, I saw this before https://gerrit.zephyrproject.org/r/10902 (dynamic PHY config on cable connect/disconnect), and saw with it too.

nashif commented 7 years ago

by Paul Sokolovsky:

A case walkthru:

Flashed freshly built echo_server.
ping 192.0.2.1 - ok
ping6 2001:db8::1 - ok
nc -u6 2001:db8::1 4242 - ok
nc -u 192.0.2.1 4242 - problem begins. How it look like is that when type in 1st packet, no response is generated. Typing 2nd line and sending it, I get response to previous packet, etc. Sometimes, after wait, a last expected response does arrive on its own, by after more than 1s subjectively, maybe 2-3s.

nashif commented 7 years ago

by Paul Sokolovsky:

Ok, a response always arrives actually, but perhaps after 4-5s. Typing a next one quickly makes previous response arrive immediately.

All that, I never saw with QEMU.

nashif commented 7 years ago

by Paul Sokolovsky:

Ok, in such state, TCP echo doesn't work at all, so it's definitely another manifestation of GH-1577.

nashif commented 7 years ago

by Paul Sokolovsky:

Now I'm torturing it for the whole 3 minutes, and it still works.

nashif commented 7 years ago

by Paul Sokolovsky:

The same session now underwent 1000 pings (standard pings), and all 6 comm channels ((ICMP + UDP + TCP) * (IPv4 + IPv6)) work well, w/o any delays. (I pay special attention to pings, because I usually start testing with them, and in most cases, the issue manifests soon with them).

This proves that this is not inherent systematic issue, but a race condition. And we can't even be sure it's race in eth_mcux, because Z IP stack surely has its own share of races and weird things yet. And likely, it's eth_mcux race attenuated by Z stack race.

That was "good" news, bad news is that we can't reproduce it in at our will (though natural reoccurrence is quite high definitely).

zephyriot / zep-jira14

frdm_k64f: Sometimes, there may be 1s delay for pings and other Ethernet packets #1532