Closed xhpohanka closed 3 years ago
I have also noticed that when I enable ASSERTS there is a strange fail, but it is probably not related with original issue
~/dev/zephyrproject/cv_test_x86$ west build -t run
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: qemu32,+nx,+pae
SeaBIOS (version rel-1.12.1-0-ga5cab58-dirty-20200625_115407-9426dddc0a1f-zephyr
)
Booting from ROM..*** Booting Zephyr OS build zephyr-v2.4.0-3346-gfd630d70c106 ***
*** 0.040000000 1359040 mg_start2:19779: [listening_ports] -> [80,443s]
*** 0.040000000 1359040 mg_start2:19779: [num_threads] -> [1]
*** 0.040000000 1359040 mg_start2:19779: [max_request_size] -> [4096]
*** 0.040000000 1359040 mg_start2:19779: [enable_keep_alive] -> [yes]
*** 0.040000000 1359040 mg_start2:19779: [keep_alive_timeout_ms] -> [1000]
[IMPLEMENTATION MISSING : sscanf]
ASSERTION FAIL [!arch_is_in_isr()] @ WEST_TOPDIR/zephyr/kernel/mutex.c:125
mutexes cannot be used inside ISRs
FAILED: zephyr/CMakeFiles/run
cd /home/honza/dev/zephyrproject/cv_test_x86/build && /opt/zephyr-sdk/sysroots/x86_64-pokysdk-linux/usr/bin/qemu-system-i386 -m 9 -cpu qemu32,+nx,+pae -device isa-debug-exit,iobase=0xf4,iosize=0x04 -nographic -net none -pidfile qemu.pid -chardev stdio,id=con,mux=on -serial chardev:con -mon chardev=con,mode=readline -serial unix:/tmp/slip.sock -kernel /home/honza/dev/zephyrproject/cv_test_x86/build/zephyr/zephyr.elf
ninja: build stopped: subcommand failed.
FATAL ERROR: command exited with status 1: /usr/bin/cmake --build /home/honza/dev/zephyrproject/cv_test_x86/build --target run
Thank you @xhpohanka for your bug report. Please try first to drastically increase all memory buffers in proj.conf using QEMU, since it is known, that CivetWeb can use big amount of memory. I'll try to get deeper look at the weekend.
Hi @Nukersson,
the code I provided is extracted and simplified from my project on STM32H7, it is much easier to break things on the embedded platform then with qemu_x86
but it is still possible so I expect that there are general issues, but sometimes I have a feeling that there can be also other issues related just to stm32h7 (I have already tried recent patches #30403 but it does not help here).
To break it I usally use a combination of a stress testing with wrk
(eg wrk -d 5 -t 16 -c 1000 http://192.0.2.1
) and simple hitting ctrl+f5 in chromium with hight frequency. The later one is usually (surprisingly) more effective to produce malfuncitons.
I'm noticing these issues
tx_bufs
decreases during the testing and never recovers)poll
call on listening sockets in civetweb's master thread still returns so accept_new_connection()
is called even if there are no new requests from a clientgetsockname()
call inside accept_new_connection()
sometimes return an error (in that case zephyr's internal net_context does not have a conn_handler registered)I have played with these civetweb settings
SO_KEEPALIVE
, a warning is hidden in source code@xhpohanka are u comping with CONFIG_DEBUG or something else like -O3?
are u comping with CONFIG_DEBUG or something else like -O3?
In my case CONFIG_DEBUG
has no impact on desribed issues. I use -Os
same as on stm32...
I'm noticing these issues
* network buffers leaks (number of available `tx_bufs` decreases during the testing and never recovers) * sometimes I get a strange state (a loop) when a `poll` call on listening sockets in civetweb's master thread still returns so `accept_new_connection()` is called even if there are no new requests from a client * `getsockname()` call inside `accept_new_connection()` sometimes return an error (in that case zephyr's internal net_context does not have a conn_handler registered)
These might be related to other issues in the net stack if we run out of resources, so not just civetweb problems. I suspect that there is resource leak somewhere in the stack if for example a net_buf allocation fails, we might bail out without releasing all the resources properly.
I just sent a fix that prevented various crashes seen when I tested with dumb_http_server_mt
sample and wrk tool. @xhpohanka could you try your scenario with #31777 and report if it helps.
The fix should help with the "network buffers leaks (number of available tx_bufs
decreases during the testing and never recovers)" case you described above.
Hi @jukkar Thank you for reaching this. I have done a very quick test with your patch. Now I'm not able to reproduce the issue with buffer leaks using wrk, so this is definitely a progress.
I'm still able to get a loop described in my second point (poll is still returning 1). I think that it is related to exhausted file descriptors. Chrome and ctrl+f5 seems to be worse stress than wrk...
UPDATE: I was able to reproduce the buffer leak, but it is much harder now. It was done again with chrome and ctrl+f5 monkey test and increased fd max number from 16 to 64.
uart:~$ net mem
Fragment length 128 bytes
Network buffer pools:
Address Total Avail Name
0x1c1880 256 254 RX
0x1c18c8 256 250 TX
0x1c1a6c 96 96 RX DATA (rx_bufs)
0x1c1aa8 192 173 TX DATA (tx_bufs)
uart:~$ net allocs
Network memory allocations
memory Status Pool Function alloc -> freed
0x177c38/1 used RX tcp_conn_alloc():1117
0x173438/1 used TX tcp_conn_alloc():1127
0x177bf0/1 used RX tcp_conn_alloc():1117
0x1733f0/1 used TX tcp_conn_alloc():1127
0x173090/1 used TX net_pkt_clone():1770
0x172ee0/1 used TX net_pkt_clone():1770
0x1733a8/1 used TX net_pkt_clone():1770
0x172ca0/1 used TX net_pkt_clone():1770
0x177140 free RX slip_input_byte():256 -> processing_data():139
0x16d278 free RDATA slip_input_byte():263 -> processing_data():139
I was able to reproduce the buffer leak, but it is much harder now.
Yep, I was able to see it too. Seems to be related to timings, I am investigating this.
Managed to find one nasty leak. If the application closed the socket while we were waiting data, the data that was pushed to application was lost. Fixed that in #31777, @xhpohanka please try the lastest version if you can.
Managed to find one nasty leak. If the application closed the socket while we were waiting data, the data that was pushed to application was lost. Fixed that in #31777, @xhpohanka please try the lastest version if you can.
Do we have some ethernet stack test suite? Can you add a test for this case?
Do we have some ethernet stack test suite? Can you add a test for this case?
We have unit tests, not sure how feasible it would be to add test cases for this. I was only able to see the issue with heavy traffic and when system run out-of network buffers, this complicates the testing.
Managed to find one nasty leak. If the application closed the socket while we were waiting data, the data that was pushed to application was lost. Fixed that in #31777, @xhpohanka please try the lastest version if you can.
I'm still able to reproduce the leaks unfortunately. Still not with wrk, but with Chrome reloading with no problem. Can I somehow provide more information which could help?
but with Chrome reloading with no problem
I just noticed, that firefox behaves in the same way. It is enough to hit ctrl+f5 about 20 times in high frequency to see the buffer leak with the test code that I have provided. It does not seem to me as extra high traffic.
I'm sure on 100% but it seems net allocs
is always referring lost buffers in net_pkt_clone()
0x1743b8/1 used TX net_pkt_clone():1765
firefox behaves in the same way
Thanks for the info. I did not try the browser route as the wrk seemed to work fine, but will try using browser.
it seems
net allocs
is always referring lost buffers innet_pkt_clone()
This was basically the same leak I fixed in latest version of #31777, there is probably some extra code path I missed. I will investigate.
will try using browser.
I was not able to replicate any leaks with the browser. Actually with the browser, I was not seeing any out-of-buffer scenarios and zephyr was able to reply in time and no errors were seen in the console (this with qemu_x86 and e1000 controller). I was also using the dumb_http_server_mt
sample app for this.
I was also using the dumb_http_server_mt sample app for this
I'm afraid that dumb_http_server_mt is not enough to reproduce. That is the reason why I have provided my code with civetweb (check the first post). If you are not able to run it please tell me what other information can I provide.
I'm aware that the issue can be in my application code (hopefully not, it is really simple) or in civetweb itself but as civetweb is the only web server provided with zephyr we should find it.
I noticed the issues only on a page that downloads some css or js, that could be the reason why wrk does not trigger it because it does not parse the server response.
@xhpohanka When you have the leak issue, what does net conn
tell, what are the states of the sockets.
If you see something like
TCP Context Src port Dst port Send-Seq Send-Ack MSS State
0x14e688 0x126fe0 8080 0 2016716979 0 1440 LISTEN
0x14e4f4 0x127090 8080 38032 4017182613 0 1460 LISTEN
0x14d6c0 0x1271f0 55257 34238 3129739646 1034923133 1460 SYN_RECEIVED
and the SYN_RECEIVED
state is stuck. Then you also need to apply #31806 which should timeout the orphaned connection properly.
I noticed the issues only on a page that downloads some css or js
That might be related here. I try to use you application next.
I do not see SYN_RECEIVED
stuck state...
uart:~$ net mem
Fragment length 128 bytes
Network buffer pools:
Address Total Avail Name
0x1c7c50 256 254 RX
0x1c7c98 256 252 TX
0x1c7e3c 96 96 RX DATA (rx_bufs)
0x1c7e78 192 183 TX DATA (tx_bufs)
uart:~$ net conn
Context Iface Flags Local Remote
[ 1] 0x140fe0 0x1c7eb4 4DU 224.0.0.251:5353 0.0.0.0:0
[ 2] 0x141074 0x1c7eb4 4ST 0.0.0.0:80 192.0.2.2:33008
[ 3] 0x141108 0x1c7eb4 4ST 0.0.0.0:443 0.0.0.0:0
TCP Context Src port Dst port Send-Seq Send-Ack MSS State
0x18382c 0x141074 80 33008 3001513734 0 1460 LISTEN
0x183698 0x141108 443 0 706889854 0 1460 LISTEN
No active connections.
uart:~$ net allocs
Network memory allocations
memory Status Pool Function alloc -> freed
0x17e038/1 used RX tcp_conn_alloc():1104
0x179838/1 used TX tcp_conn_alloc():1112
0x17dff0/1 used RX tcp_conn_alloc():1104
0x1797f0/1 used TX tcp_conn_alloc():1112
0x179400/1 used TX net_pkt_clone():1765
0x1795b0/1 used TX net_pkt_clone():1765
0x17dfa8 free RX slip_input_byte():256 -> processing_data():139
...
update: and #31806 also has no impact here...
I am probably doing something wrong as I get
cv_test/src/libc_extensions.c:279:5: error: redefinition of 'putc'
279 | int putc(int c, FILE *stream)
| ^~~~
In file included from cv_test/src/libc_extensions.c:9:
cv_test/zephyr/lib/libc/minimal/include/stdio.h:59:19: note: previous definition of 'putc' was here
59 | static inline int putc(int c, FILE *stream)
| ^~~~
cv_test/src/libc_extensions.c:284:5: error: redefinition of 'putchar'
284 | int putchar(int c)
| ^~~~~~~
In file included from cv_test/src/libc_extensions.c:9:
cv_test/zephyr/lib/libc/minimal/include/stdio.h:63:19: note: previous definition of 'putchar' was here
63 | static inline int putchar(int c)
| ^~~~~~~
make[3]: *** [CMakeFiles/app.dir/build.make:76: CMakeFiles/app.dir/src/libc_extensions.c.obj] Error 1
How do you compile this?
I tried with this:
cmake -B build -DBOARD=qemu_x86 . -DOVERLAY_CONFIG=zephyr/samples/net/sockets/echo_server/overlay-e1000.conf -DCONFIG_NET_DEBUG_NET_PKT_ALLOC=y -DCONFIG_LOG_BUFFER_SIZE=65536
make -C build -j 9 run
I see - I'm using slightly older zephyr base (with your patches cherry-picked). There was a commit 98747abf9c3671e9b836dba4c8c8ae48c47f0c23 which added putc and putchar into minimal libc library. Please just comment out my putc and putchar definitions in libc_extensions.c
update: I have just reproduced it again using your branch tcp2-release-resources-when-out-of-mem and firefox and the same build command as you.
The mkmfs.sh
needed xxd
and after installing it, I still get this error
In file included from cv_test/src/mfs.c:2:
cv_test/build/zephyr/include/generated/mfs_data.h:3582:1: error: expected expression before ',' token
3582 | ,
| ^
make[3]: *** [CMakeFiles/app.dir/build.make:102: CMakeFiles/app.dir/src/mfs.c.obj] Error 1
It seems that the script requires minify
~from https://www.minifier.org/~
I wonder why you have extra generator script, wouldn't the generate_inc_file
cmake macros work?
Edit: the minify link does not seem to be proper for the minify program, where should I get it?
Excuse me. It seems, that I have not simplified my example enough... The reason of the last error is probably missing minify (at least I get the same error in such case). Minify is definitely not necessary to run this example. I have pushed a new version to my git, I hope it will work now.
I wonder why you have extra generator script, wouldn't the generate_inc_file cmake macros work?
It probably would, but it was written by my colleague who is not familiar with zephyr build system.
I managed to see the issue (used firefox, and just kept ctrl-f5 pressed for several seconds). What looks suspicious is that net-shell shows this loop of debug messages
....
[INTERNAL ERROR]: accept_new_connection @ 18117
accept_new_connection: getsockname() failed: EINVAL
[00:12:32.120,000] <dbg> net_sock.z_impl_zsock_getsockname: (): getsockname: ctx=0x13ee84, fd=7
uart:~$ [INTERNAL ERROR]: accept_new_connection @ 18117
accept_new_connection: getsockname() failed: EINVAL
[00:12:32.140,000] <dbg> net_sock.zsock_accept_ctx: (): accept: ctx=0x13f040, fd=11
[00:12:32.140,000] <dbg> net_sock.z_impl_zsock_getsockname: (): getsockname: ctx=0x13f040, fd=11
[00:12:33.140,000] <dbg> net_sock.zsock_accept_ctx: (): accept: ctx=0x13ee84, fd=8
[00:12:33.140,000] <dbg> net_sock.z_impl_zsock_getsockname: (): getsockname: ctx=0x13ee84, fd=8
uart:~$ [INTERNAL ERROR]: accept_new_connection @ 18117
accept_new_connection: getsockname() failed: EINVAL
[INTERNAL ERROR]: accept_new_connection @ 18117
accept_new_connection: getsockname() failed: EINVAL
[INTERNAL ERROR]: accept_new_connection @ 18117
accept_new_connection: getsockname() failed: EINVAL
[00:12:33.160,000] <dbg> net_sock.zsock_accept_ctx: (): accept: ctx=0x13f040, fd=12
[00:12:33.160,000] <dbg> net_sock.z_impl_zsock_getsockname: (): getsockname: ctx=0x13f040, fd=12
[00:12:34.140,000] <dbg> net_sock.zsock_accept_ctx: (): accept: ctx=0x13ee84, fd=9
[00:12:34.140,000] <dbg> net_sock.z_impl_zsock_getsockname: (): getsockname: ctx=0x13ee84, fd=9
[00:12:34.160,000] <dbg> net_sock.zsock_accept_ctx: (): accept: ctx=0x13f040, fd=13
[00:12:34.160,000] <dbg> net_sock.z_impl_zsock_getsockname: (): getsockname: ctx=0x13f040, fd=13
....
I do not see anyone reading the data away, thus the net_pkt are just pending in the receive queue even if the connections have died long time ago. Did not investigate civetweb that deeply, but after the accept, it really needs to call recv or similar, otherwise the packets are just hanging there. The net-shell shows that TX packets are running out, don't know what is going on there. It might be that we are using wrong net_pkt pool (TX instead of RX). Anyway, looks like there is something fishy with the civetweb or the application and recv is not called.
Hi @jukkar,
the suspicious state from previous comment appears when the app is running out of free file descriptors. Civetweb is probably handling this wrong. I cannot reproduce it when I increase the CONFIG_POSIX_MAX_FDS
to some higher number (48).
I noticed that your fixes were merged to master so I rechecked my app again against actual master. It is much better, however I was still able to see a leak. It takes really big effort with ctrl+f5, but it is sometimes there. Next to the number of fds I have also increased a number of ciwetweb worker threads to 4.
I was still able to see a leak
Yep, I saw that too but it does not look it comes from tcp as I wrote in my earlier comment. After the TCP has received the data, it is put the socket receive queue. If no one is reading that queue, we basically have a "leak". And it looked like that from my limited testing.
I saw that too but it does not look it comes from tcp
Can you give me a hint what is a best way to trace it, please? I'd really like to have this reliable...
Can you give me a hint what is a best way to trace it, please? I'd really like to have this reliable...
Usually I just place debug prints in relevant places in the code. For the accept loop log above, I needed to disable civetweb debugging prints because there was just too much data printed. If you are using native_posix board, then debugging with gdb is of course quite convenient. I did not verify this, but because there was just lot of prints about accept (in a loop), but no prints about data being read, it looked like the receive queue in subsys/net/lib/sockets/sockets.c was not being read. As the receive queue is shared with accept queue, it is also possible that there is some bug in sockets.c related to accept/read functionality that is seen if we have these net_buf/pkt allocation issues. But after a quick review, I did not spot immediately where such issue would be in sockets.c.
@xhpohanka Sorry, I do not have currently the capacity to work on this topic. How is it going? Does provided changes fixed your problem?
@jukkar Thanks for your very fast support!
Hello @Nukersson,
fixes by @jukkar definitely helped. I still have a feeling that it is not 100% but so far I fixed it just by providing bigger number of fds and praying that our project is not failing in production....
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
Bump! Still happening on stm32h7xx. Just had a rough few weeks with the same exact problem. Dummy hold F5 key test with Firefox over HTTPS(not HTTP) manages to break things in couple of seconds. There seems to be a race condition in the tcp stack regarding the RX net_pkt leaks between the connection and listen thread when the stack is running out of free net_pkts. It happens with the civetweb and with custom https server aswell. What fixed the issue for me was setting the these threads with Cooperative priority, which isnt nice but at least there are no RX net_pkt leaks.
Finally i have found a soultion to the leak. Please have a look at my pull request #42328 . I hope the pull request explains it all. All in all it was a simple fix, however damn hard to find :)
I am new to the github contribution system, so excuse me if i am doing something wrong. @jukkar i have seen you under the network related issues, thats why i am mentioning you.
Thank you for cooperation
Hi @neja20, thanks for reaching this out. I will try your patch later this week, from the description it looks reasonable to me, but I'm not so familiar with the stack to comment it deeper.
Regarding the PR, you definitely need to keep zephyr coding style (tab indenting width 8 etc.). You can check if your commit is ok with checkpath.pl
script. It will help you to find contributing issues before pushing it to github. I usually use it like this scripts/checkpatch.pl -g HEAD
, it checks your latest commit.
Describe the bug Civetweb on zephyr can easilly hang when there are no free filedscriptors. I have noticed it on a page that downloads some styles, javascript etc. Not on a simple text page.
To Reproduce I have prepared a test case, it works in
qemu_x86
https://github.com/xhpohanka/cv_test It can be reproduced simply by accessing a page (http://192.0.2.1) with a browser and doing several fast refreshes (ctrl+F5).When civetweb threads have the same priority the web server will hang because it ends in a loop in master thread where
accept()
still returns an error. If we add a sleep there (see https://github.com/xhpohanka/cv_test/blob/master/civetweb_dbg.patch) it will continue to run, but it leads to leaks like it can be seen in following logs (checknet mem
andnet allocs
outputs)I do not know yet if the problem is in Zephyr or Civetweb where I have already reported it (civetweb/civetweb#962)
Log with normal operation
Log with leaks