zephyriot / zep-jira14

0 stars 0 forks source link

Enabling networking for targets w/o network hw causes hang on boot #1946

Open nashif opened 7 years ago

nashif commented 7 years ago

Reported by Paul Sokolovsky:

This is reproducible with BOARD=96b_carbon which doesn't have networking support in the mainline. And the issue title doesn't explain the situation well, how it happens is that if you have a Zephyr application which has networking enabled (CONFIG_NETWORKING=y), then if you build it and boot it on a target w/o networking hardware, it's just locked up with no output (no debug logging enabled).

This was originally reported for MicroPython Zephyr port: https://github.com/micropython/micropython/pull/2975

(Imported from Jira ZEP-2105)

nashif commented 7 years ago

by Paul Sokolovsky:

With the current master, and with net logging enabled, I now I see on 96b_carbon:

***** MPU FAULT *****
  Executing thread ID (thread): 0x200018ac
  Faulting instruction address:  0x800a08e
  Data Access Violation
  Address: 0x8
Fatal fault in essential thread! Spinning...
nashif commented 7 years ago

by Paul Sokolovsky:

That's net_if_get_ll_reserve().

nashif commented 7 years ago

by Paul Sokolovsky:

Easy way to reproduce the issue of IP stack having problems when know L2 driver is available:

In samples/net/echo_server, patch:

--- a/samples/net/echo_server/prj_qemu_x86.conf
+++ b/samples/net/echo_server/prj_qemu_x86.conf
 CONFIG_SYS_LOG_NET_LEVEL=2
-CONFIG_NET_SLIP_TAP=y
+#CONFIG_NET_SLIP_TAP=y
 CONFIG_SYS_LOG_SHOW_COLOR=y

Then with "make run", get:

To exit from QEMU enter: 'CTRL+a, x'
[QEMU] CPU: qemu32
qemu-system-i386: warning: Unknown firmware file in legacy mode: genroms/multiboot.bin

shell> [echo-server] [INF] init_app: Run echo server
***** CPU exception 6
Current thread ID = 0x00111e60
Faulting segment:address = 0x0008:0x01fa71b1
eax: 0x00111a20, ebx: 0x00110d20, ecx: 0x0010e93a, edx: 0x00000000
esi: 0x00000000, edi: 0x00110d20, ebp: 0x00116380, esp: 0x00116374
eflags: 0x212
Fatal fault in essential thread! Spinning...

Note that it's slightly different issue from 96b_carbon's: in Carbon case, it's fault right on boot, qemu_x86 however manages to print "init_app: Run echo server", and faults in about a second.

nashif commented 7 years ago

by Paul Sokolovsky:

qemu_x86 above is apparently a call by a trash pointer, outside the code range.

nashif commented 7 years ago

by Tomasz Bursztyka:

Would be nice to have an error at build time if there is not struct net_if instantiated. But that is something known only at linking time, not sure it's possible to do anything at that point.

nashif commented 7 years ago

by Paul Sokolovsky:

Tomasz Bursztyka : At first glance, and with my hat of "maintainer of Zephyr application which strives to be truly cross-platform (cross-board)", I'm not sure I agree. MicroPython Zephyr port has networking on by default (because it's one of the selling points of Z), and I wouldn't want a random user who tries it on a new board get a build failure - that makes users frustrated, they suspect my app and maybe start to spread FUD about it, while it's a Zephyr problem.

So, what I'm targetting so far is detecting this condition, issuing a fat warning to user that networking is down, and go on as if nothing happened. We'll need to see how well that goes. E.g. if there's no crash at startup, but a crash on calling any networking function, and fixing that requires a gazillion null pointer checks, then yeah, your solution suddenly becomes very attractive, and I'm sure we'll find a way to implement it once proven unavoidable.

nashif commented 7 years ago

by Paul Sokolovsky:

So, CONFIG_NET_LOG_GLOBAL=y, there're finally more leads (on 96b_carbon):

[net/core] [DBG] net_init: (0x20001888): Priority 90
[net/net_pkt] [DBG] net_pkt_init: (0x20001888): Allocating 4 RX (272 bytes), 2 TX (136 bytes), 16 RX data (2368 bytes) and 16 TX data (2368 bytes) buffers
[net/core] [DBG] l2_init: (0x20001888): Network L2 init done
[net/route] [DBG] net_route_init: (0x20001888): Allocated 8 routing entries (448 bytes)
[net/route] [DBG] net_route_init: (0x20001888): Allocated 8 nexthop entries (224 bytes)
[net/core] [DBG] l3_init: (0x20001888): Network L3 init done
[net/core] [DBG] net_rx_thread: (0x20000a70): Starting RX thread (stack 1500 bytes)
[net/if] [DBG] net_if_init: (0x20000a70): 
[net/if] [WRN] net_if_init: There is no network interface to work with!
[echo-server] [INF] init_app: Run echo server
[net/if] [DBG] net_if_ipv6_addr_add: (0x20001888): [0] interface 0x20000870 address 2001:db8::1 type MANUAL added
***** MPU FAULT *****
  Executing thread ID (thread): 0x20001888
  Faulting instruction address:  0x8006b5e
  Data Access Violation
  Address: 0x0
Fatal fault in essential thread! Spinning...
nashif commented 7 years ago

by Paul Sokolovsky:

Well, the problem turns out two-fold. First of all, it's an application to blame actually - no in-tree sample apps appear to check result of net_if_get_default(), so out-of-tree app doesn't either, d'oh.

But checking it wouldn't help, because net_if_get_default() doesn't work up to its spec, where it promises to return NULL in case of no interfaces. Instead, it returns a random pointer. This gets fixed by https://github.com/zephyrproject-rtos/zephyr/pull/198 .

(Started to write this yesterday and fell asleep.)

nashif commented 7 years ago

by Paul Sokolovsky:

Some fixes for this issue went into 1.8.0, but it needs further looking into still.