IP stack can't recover from a packet overload

therealprof commented 6 years ago

I've connected my Nucleo-F429ZI to a reasonably busy network (i.e. a constant flow of broadcast messages which also end up at the network interface of the MCU) and it very soon ends up receiving more packets than it can handle but the main problem is that even if the load lightens the network application doesn't receive any packets anymore and renders the system useless.

After the first occurrence of this error message the application is dead:

[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer

I tried increasing buffers and stack but to no avail. Not sure which information I can provide to aid in debugging but it is 100% reproducible (and hence very annoying ;) ):

shell> net stats

Global statistics
=================
IPv4 recv      579      sent    12      drop    569     forwarded       0
IP vhlerr      0        hblener 0       lblener 0
IP fragerr     0        chkerr  0       protoer 0
ICMP recv      3        sent    5       drop    0
ICMP typeer    0        chkerr  0
UDP recv       7        sent    5       drop    450
UDP chkerr     0
Bytes received 152255
Bytes sent     1237
Processing err 489
shell> net stats all

Interface 0x2000a140 (Ethernet) [0]
===================================
IPv4 recv      579      sent    12      drop    569     forwarded       0
IP vhlerr      0        hblener 0       lblener 0
IP fragerr     0        chkerr  0       protoer 0
ICMP recv      3        sent    5       drop    0
ICMP typeer    0        chkerr  0
UDP recv       7        sent    5       drop    450
UDP chkerr     0
Bytes received 152315
Bytes sent     1237
Processing err 490
shell> [net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():360: Failed to get free buffer
[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():360: Failed to get free buffer
[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():360: Failed to get free buffer

shell> net stats

Global statistics
=================
IPv4 recv      2954     sent    12      drop    2944    forwarded       0
IP vhlerr      0        hblener 0       lblener 0
IP fragerr     0        chkerr  0       protoer 0
ICMP recv      3        sent    5       drop    0
ICMP typeer    0        chkerr  0
UDP recv       7        sent    5       drop    2190
UDP chkerr     0
Bytes received 834305
Bytes sent     1237
Processing err 2716

mike-scott commented 6 years ago

Can you include a copy of the .config for the app yiu have loaded? As well that the buffer setyings you've already tried raising?

therealprof commented 6 years ago

.config

therealprof commented 6 years ago

I tried bumping all values I could find in the networking area without any noticeable effect but I'm happy to try specific changes you'd like me to try.

mike-scott commented 6 years ago

The key value for RX buffers would be: CONFIG_NET_BUF_RX_COUNT=10

Also, enable the following debugs: CONFIG_NET_LOG=y CONFIG_SYS_LOG_NET_LEVEL=4 CONFIG_NET_DEBUG_NET_PKT=y

Once added, you can use the following 2 commands for debugging in the console shell:

net mem: displays the status of the net packets and buffers (allocated vs. freed)
net allocs: shows a kind of stack trace for each packet/buffer indicating when it was last freed or if it's still allocated, where it was allocated.

mike-scott commented 6 years ago

I think we're actually not freeing back up buffers on an error path, so they're staying allocated.

therealprof commented 6 years ago

I'll try that tomorrow in the office.

therealprof commented 6 years ago

I've put a full capture of the debugging log as well as the commands at the end here:

https://gist.github.com/therealprof/6a7263c6633b461be79ad270636b0468

Increase of the RX buffers made no difference.

mike-scott commented 6 years ago

Wow, there are packets in your log which are running > 1KB data size. Is the eth port in promiscuous mode?

Can you run "net allocs" once you run out of buffers?

therealprof commented 6 years ago

Wow, there are packets in your log which are running > 1KB data size. Is the eth port in promiscuous mode?

No, just a regular office network.

Can you run "net allocs" once you run out of buffers?

It's at the bottom of the log.

jukkar commented 6 years ago

I think we're actually not freeing back up buffers on an error path, so they're staying allocated.

I think you are right here. Increasing the number bufs does not really help but just postpones the inevitable out-of-bufs issue.

jukkar commented 6 years ago

Are you actively sending something to zephyr device, or is it mostly some multicast / broadcast traffic? Can you check using wireshark what kind of traffic you have in the network, the log does not really reveal that with these debug options. I am just trying to pin point the code where to start debugging this.

jukkar commented 6 years ago

According to the log, you are not running out of buffers.

The "net allocs" command tells that all the buffers are freed just fine. Also at the end of the log there is a print: [net/pkt] [DBG] net_pkt_print: (0x20004358): TX 10 RX 50 RDATA 50 TDATA 10 which tells how many buffers are available and they are the same as how many were allocated at build time: net_pkt_init: (0x20006060): Allocating 50 RX (3600 bytes), 10 TX (720 bytes), 50 RX data (0 bytes) and 10 TX data (0 bytes) buffers

Also according to the log the device never fully run out of buffers [net/pkt] [DBG] net_pkt_print: (0x20004358): TX 10 RX 15 RDATA 1 TDATA 10 so there was always at least one net_buf buffer available (128 bytes) for incoming data. The RDATA or RX field would be 0 if we run out of buffers or have buffer leak.

jukkar commented 6 years ago

My bad, there was out-of-buffers situations in the log:

[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer

but according to the log, the system recovered that successfully.

therealprof commented 6 years ago

Are you actively sending something to zephyr device, or is it mostly some multicast / broadcast traffic? Can you check using wireshark what kind of traffic you have in the network, the log does not really reveal that with these debug options. I am just trying to pin point the code where to start debugging this.

Well, yes, I'm trying to use LwM2M but most of the traffic is actually broadcasts in the network.

[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer but according to the log, the system recovered that successfully.

No it does not. The application (lwm2m-client) doesn't receive any traffic any more after one of those messages. Actually earlier, but when this message appears you know it has happened and it's time for a reboot.

I am very convinced it is due to all the traffic in the network because the same device will run much longer in a different environment, e.g. my home. At the office it sometimes even fails to register at the lwm2m server but at the lastest 3 minutes after boot it is not communicating anymore.

tbursztyka commented 6 years ago

Hum RX are going lower and lower, and worse RDATA is a lot near 0 (thus why sometimes it is unable to get enough frags for the received packet). Looks like the host has not enough time to consume received packets. Sounds like RX queue has a too low priority maybe. @jukkar how can this be tweaked now? (we used to have one rx thread but now it's slightly different as we have rx queue(s). )

Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore).

tbursztyka commented 6 years ago

Note that, even if rx queue priority could fix your issue, there is no buffer bloat mitigation logic in Zephyr's net stack. So it is still possible to flood it (knowing that it runs on a board with a limited amount of RAM with a bearer like Ethernet)

therealprof commented 6 years ago

Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore).

So the warnings disappear but the data flow still stops. Maybe there's some other foul play here.

mike-scott commented 6 years ago

@tbursztyka I guess the core problem is that the ethernet driver can't know if the packet is wanted or unwanted until the fully formed(allocated) packet+buffers reaches the ip management / connection code ... which could certainly lead to flooding as you mentioned if the RX queue can't process the packets fast enough to free back up the net_pkt/net_bufs for more incoming packets.

Examining the ethernet case: could we send the "destination MAC" from the very first buffer to a new L2 (ethernet specific) function which could determine if the entire frame could be tossed away instead of allocating any more buffers.

The downside is that each ethernet driver would need to make that call prior to calling net_recv_data.

tbursztyka commented 6 years ago

@mike-scott What you describe is the promiscuous mode. Afaik most controllers, if not all in fact, filters packet on the dst mac: if it's not for them, they drop it. That said, if the feature is not enabled or badly configured on driver level, then that won't work. Could be the case here, I don't know.

@therealprof The frag size tweaking is just to use memory in a bit better way (less frags so less overhead of net_buf struct), but there is still the issue of rx packet consumption.

mike-scott commented 6 years ago

@tbursztyka 1KB broadcast packets seem large. I guess we would need some debugging to make sure these packets are valid.

jukkar commented 6 years ago

No it does not. The application (lwm2m-client) doesn't receive any traffic any more after one of those messages. Actually earlier, but when this message appears you know it has happened and it's time for a reboot.

I meant to say that from memory point of view the system recovers this low memory issue just fine (according to provided logs). It is totally different matter what happens in other parts of the zephyr (be it in IP stack or application).

jukkar commented 6 years ago

Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore).

I would argue that the system should work better with 128 byte buffer than with 512. This depends on the network traffic of course but with smaller buffers the network packets are probably better fit into net_buf and we do not run out of buffers easier.

Note that we have a samples/net/throughput_server sample app for measuring UDP traffic throughput, you could try that one too and see what happens with your board. See the readme file in that sample for usage details.

jukkar commented 6 years ago

Note that, even if rx queue priority could fix your issue, there is no buffer bloat mitigation logic in Zephyr's net stack. So it is still possible to flood it (knowing that it runs on a board with a limited amount of RAM with a bearer like Ethernet)

We cannot prevent flooding but we should recover from it of course. For testing this one can use the throughput_server app I mentioned earlier.

jukkar commented 6 years ago

In order to get some idea what goes wrong and in what part of the system, could you enable some debugging options and post the log somewhere. We could try first with

CONFIG_NET_LOG_GLOBAL=y
CONFIG_SYS_LOG_LWM2M_LEVEL=4
CONFIG_SYS_LOG_NET_LEVEL=4

That setting will print lot of stuff and will require lot of flash so we might need to tweak that a bit but we can start with this one.

therealprof commented 6 years ago

@jukkar

No worries,

Memory region         Used Size  Region Size  %age Used
           FLASH:       98244 B         2 MB      4.68%
            SRAM:       41152 B       256 KB     15.70%
        IDT_LIST:         244 B         2 KB     11.91%

However I can only test next Wednesday (if I don't forget), won't be at the office before.

jukkar commented 6 years ago

Was that for lwm2m_client? When I compile for nucleo_f429zi I get

Memory region         Used Size  Region Size  %age Used
           FLASH:      163492 B         2 MB      7.80%
            SRAM:       46336 B       256 KB     17.68%
        IDT_LIST:         244 B         2 KB     11.91%

Only thing I added to prj.conf file in that sample was the CONFIG_NET_LOG_GLOBAL=y

therealprof commented 6 years ago

@jukkar Yes, but I had to disable IPv6. But there's still plenty of room.

jukkar commented 6 years ago

The issue with IPv6 sounds weird. I would be interested in getting more information and logs about this so can you prepare also a log with IPv6 enabled and also without IPv6.

therealprof commented 6 years ago

@jukkar #6307

jukkar commented 6 years ago

@jukkar #6307

Ah, it was that compiler thingy, never mind then :)

pfalcon commented 6 years ago

and it very soon ends up receiving more packets than it can handle but the main problem is that even if the load lightens the network application doesn't receive any packets anymore and renders the system useless.

This pretty much can be an (assumed) description of what happens under the surface in an earlier issue of https://github.com/zephyrproject-rtos/zephyr/issues/3132 . Except that in my case, "more packets than it can handle" is very unlikely to be involved, just some load applied ;-).

nashif commented 6 years ago

@pfalcon @jukkar What is the status of this?

pfalcon commented 6 years ago

What is the status of this?

Why do you think https://github.com/zephyrproject-rtos/zephyr/issues/8168 is there? "Not debuggable currently."

jukkar commented 6 years ago

We do not know the root cause yet, needs more investigation. Logging enhancements would definitely help with this.

nashif commented 6 years ago

what is the status of this issue? @jukkar?

jukkar commented 6 years ago

Found issues in echo-server/echo-client that those did not recover properly from buffer exhaust. I will send a fix for this issue. This issue was created probably for some other application so it does not help here. I am also investigating to replace all K_FOREVER buffer allocations in IP stack to have a timeout.

jukkar commented 6 years ago

I have been testing these "hang" issues with #9118 and it seems to work ok, meaning I was not able to reproduce the overload issue after applying that PR. @therealprof are you able to test that pr if it works for you?

therealprof commented 6 years ago

I'm on vacation. Feel free to declare it fixed and I'll follow up with a new ticket if I still see problems.

pfalcon commented 6 years ago

It's not proven that https://github.com/zephyrproject-rtos/zephyr/commit/7c7cfdda50b066f4d9d1d8b5d2143a0cf19fda5e fixes this, reopening.

pfalcon commented 6 years ago

@therealprof: Are you back from vacation now? Can you please retest this to see if there're any improvements? (With my usual tests, I don't see any so far, so we should keep pulling on this thing.)

therealprof commented 6 years ago

@pfalcon Yes I am. Unfortunately I don't have the Nucleo-F429ZI here with me in the "hostile" environment and the FRDM-K64F build is broken, too. :(

pfalcon commented 6 years ago

and the FRDM-K64F build is broken, too. :(

Can you elaborate on this?

therealprof commented 6 years ago

@pfalcon See #9558.

pfalcon commented 6 years ago

@therealprof, Btw, if you still experience issue, I recommend applying and enabling (CONFIG_SYS_LOG_DBG_ERR=y) my cute patch: https://github.com/zephyrproject-rtos/zephyr/pull/8769 , and see if there're any correlation with the state it gets into and logging printed.

therealprof commented 6 years ago

@pfalcon I tried on my FRDM_K64F in the office today (aka hostile networking environment) and it was running extremely unstable (lots of buffer problems and faults: "Imprecise data bus error") though I'm not convinced it is the same issue as before. I guess I need to crank up the logging and try again on Monday.

pfalcon commented 6 years ago

@therealprof

it was running extremely unstable (lots of buffer problems and faults: "Imprecise data bus error")

Can you please give more details on how to reproduce it? I assume it's still lwm2m_client sample. I have neither busy Ethernet framework nor even LwM2M server here, and I don't get faults running from today's master. Perhaps, the faults are exactly dependent on the network traffic you have? Any clarifications/step to reproduce would be appreciated.

***** Booting Zephyr OS v1.12.0-1654-g3b80998ff2 *****
Zephyr Shell, Zephyr version: 1.13.0-rc1
Type 'help' for a list of available commands
shell> [lib/lwm2m_engine] [DBG] lwm2m_engine_init: LWM2M engine thread started
[lwm2m_obj_security] [DBG] security_create: Create LWM2M security instance: 0
[lwm2m_obj_server] [DBG] server_create: Create LWM2M server instance: 0
[lwm2m_obj_device] [DBG] device_create: Create LWM2M device instance: 0
[lwm2m_obj_firmware] [DBG] firmware_create: Create LWM2M firmware instance: 0
[lwm2m-client] [INF] main: Run LWM2M client
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/9, value:0x20007207, len:1
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/10, value:0x20007204, len:4
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/20, value:0x20007207, len:1
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/21, value:0x20007204, len:4
[lib/lwm2m_engine] [DBG] lwm2m_engine_create_obj_inst: path:3303/0
[ipso_temp_sensor] [DBG] temp_sensor_create: Create IPSO Temperature Sensor instance: 0
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3303/0/5700, value:0x20007220, len:8
[lib/lwm2m_engine] [DBG] lwm2m_engine_create_obj_inst: path:3311/0
[ipso_light_control] [DBG] light_control_create: Create IPSO Light Control instance: 0
[lib/lwm2m_rd_client] [INF] lwm2m_rd_client_start: LWM2M Client: frdm_k64f
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 0
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954

shell> [lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954

therealprof commented 6 years ago

@pfalcon The new crashes I've been seeing seem to be mostly unrelated to general network traffic, at least a Nucleo board will continue to run for a whole day just fine. However the FRDM will reliably crash with all the time and I can easily speed it up with LwM2M operations:

[net/core] [DBG] net_rx: (0x200021c4): Received pkt 0x20004a60 len 60
[net/ethernet] [DBG] ethernet_recv: (0x200021c4): iface 0x20009240 src 42:DA:4D:B4:4D:49 dst 00:04:9F:44:94:AA type 0x800 len 60
[net/ipv4] [DBG] net_ipv4_process_pkt: (0x200021c4): IPv4 packet received from 212.18.24.70 to 10.0.0.31
[net/conn] [DBG] net_conn_input: (0x200021c4): Check UDP listener for pkt 0x20004a60 src port 5683 dst port 65052 family 2 chksum 0xb305 len 36
[net/conn] [DBG] net_conn_input: (0x200021c4): [1] match found cb 0x00006eb9 ud 0x20000000 rank 0x27
[net/ctx] [DBG] net_context_packet_received: (0x200021c4): Set appdata 0x20005222 to len 8 (total 36)
[lib/lwm2m_engine] [DBG] lwm2m_udp_receive: checking for reply from [212.18.24.70]
[lib/lwm2m_engine] [DBG] handle_request: No accept option given. Assume OMA TLV.
[net/ethernet] [DBG] net_eth_fill_header: (0x200021c4): iface 0x20009240 src 00:04:9F:44:94:AA dst 42:DA:4D:B4:4D:49 type 0x800 len 180
[net/ethernet] [DBG] ethernet_send: (0x200021c4): Found ARP entry, sending pkt 0x20004dc0 to iface 0x20009240
***** BUS FAULT *****
  Imprecise data bus error
***** Hardware exception *****
Current thread ID = 0x200021c4
Faulting instruction address = 0x15210
Fatal fault in thread 0x200021c4! Aborting.
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004aa8 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004af0 len 60
***** BUS FAULT *****
  Imprecise data bus error
***** Hardware exception *****
Current thread ID = 0x20002610
Faulting instruction address = 0x5586
Fatal fault in thread 0x20002610! Aborting.
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004a18 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x200049d0 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004988 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004940 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x200048f8 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x200048b0 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004868 len 250

Note that there's a few seconds delay before the last aborting then the net_recv_data output.

0x15210 is sys_dlist_remove 0x5586 is in memcpy

It is a bit nasty to debug in normal mode, I'm only getting:

Breakpoint 1, _BusFault (fromHardFault=0, esf=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:291
291 static int _BusFault(NANO_ESF *esf, int fromHardFault)
(gdb) bt
#0  _BusFault (fromHardFault=0, esf=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:291
#1  0x00004e22 in _FaultHandle (esf=0x20006bb8 <rx_stack+800>, fault=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:559
#2  _Fault (esf=0x20006bb8 <rx_stack+800>, exc_return=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:719
#3  0x00004f7a in __usage_fault () at /Users/egger/oss/zephyr/arch/arm/core/fault_s.S:143
#4  <signal handler called>
#5  0xaaaaaaaa in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

And of course once I set options for "better debugging experience" and fire it up in the debugger it chucks along just fine.

therealprof commented 6 years ago

I'm running this on 3b80998ff2204c9edecdae88a4849a6b26e7d06a, BTW.

pfalcon commented 6 years ago

I'm running 3b80998 too, as can be seen from log above. But for me, memcpy is at 0x5836. I wonder if going for that "reproducible build" stuff would be actually useful for us. You run SDK 0.9.3, don't you?

pfalcon commented 6 years ago

(Just had my laptop battery crash, so will write many small msgs.)

I also see that faults happen with real LwM2M interaction, I don't have setup to reproduce it. But such kinds of faults look suspiciously stack-related, so I'd suggest bumping stack size, and maybe not just main, but other too.

zephyrproject-rtos / zephyr

IP stack can't recover from a packet overload #7571