yrutschle / sslh

Applicative Protocol Multiplexer (e.g. share SSH and HTTPS on the same port)
https://www.rutschle.net/tech/sslh/README.html
GNU General Public License v2.0
4.56k stars 368 forks source link

SSLH errors/breaks on low memory #72

Closed kecksk0enig closed 3 years ago

kecksk0enig commented 8 years ago

Hi I am running sslh on a (small) vps which is process number limited so I did go the sslh-select path. However after running for a while sslh silently quits.

I did run sslh-select in foreground to see what exactly happens. Here is the Log:

forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to ssl failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available * Error in `/usr/sbin/sslh': munmap_chunk(): invalid pointer: 0x00000000010b0818 * Aborted

It looks like the server run out of memory (as said it is a small vps with ~2GB ram) also running a mailserver (where clamd is hogging the memory)

I can live with the fact that some requests are not handled but sslh should remain live so I dont have to restart it from time to time.

Thanks in advance

yrutschle commented 8 years ago

On Fri, Dec 11, 2015 at 04:13:42AM -0800, kecksk0enig wrote:

It looks like the server run out of memory (as said it is a small vps with ~2GB ram) also running a mailserver (where clamd is hogging the memory)

I can live with the fact that some requests are not handled but sslh should remain live so I dont have to restart it from time to time

Congratulation, that's the first crash reported in... well, a long time.

Can you run it under valgrind and report the trace?

Something like:

valgrind --leak-check=full ./sslh-select [... usual sslh option ...]

Cheers, Y.

kecksk0enig commented 8 years ago

just to give a little headup. been running sslh with valgrind for 24h now, till now it is still running. I ll write back after next crash.

kecksk0enig commented 8 years ago

Finally it crashed again:

==22768== Memcheck, a memory error detector ==22768== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==22768== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info ==22768== Command: /usr/sbin/sslh --user sslh --listen 87.106.9.235:443 --listen 87.106.9.235:11372 --listen 87.106.9.235:11371 --ssl 127.0.0.1:443 --http 127.0.0.2:11371 --ssh 127.0.0.1:22 --pidfile /var/run/sslh/sslh.pid -f ==22768== sslh-select head-2015-12-09 started

then numerous log infos about connections successfully redirected

then the crash:

forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to http failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available Activity on fd2 while probing, impossible state: 1 fd 11, 0 deferred fd -1, 0 deferred ==22768== ==22768== HEAP SUMMARY: ==22768== in use at exit: 323,895 bytes in 1,615 blocks ==22768== total heap usage: 514,397 allocs, 512,790 frees, 299,056,108 bytes allocated ==22768== ==22768== 5,704 bytes in 1 blocks are definitely lost in loss record 9 of 12 ==22768== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x404085: defer_write (common.c:200) ==22768== by 0x40445C: fd2fd (common.c:313) ==22768== by 0x40287A: shovel (sslh-select.c:167) ==22768== by 0x403782: main_loop (sslh-select.c:331) ==22768== by 0x406241: main (sslh-main.c:594) ==22768== ==22768== 57,344 bytes in 7 blocks are definitely lost in loss record 11 of 12 ==22768== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x404085: defer_write (common.c:200) ==22768== by 0x404408: fd2fd (common.c:303) ==22768== by 0x40287A: shovel (sslh-select.c:167) ==22768== by 0x403782: main_loop (sslh-select.c:331) ==22768== by 0x406241: main (sslh-main.c:594) ==22768== ==22768== 248,799 bytes in 1,593 blocks are definitely lost in loss record 12 of 12 ==22768== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x404085: defer_write (common.c:200) ==22768== by 0x406A04: probe_client_protocol (probe.c:314) ==22768== by 0x4035C3: main_loop (sslh-select.c:309) ==22768== by 0x406241: main (sslh-main.c:594) ==22768== ==22768== LEAK SUMMARY: ==22768== definitely lost: 311,847 bytes in 1,601 blocks ==22768== indirectly lost: 0 bytes in 0 blocks ==22768== possibly lost: 0 bytes in 0 blocks ==22768== still reachable: 12,048 bytes in 14 blocks ==22768== suppressed: 0 bytes in 0 blocks ==22768== Reachable blocks (those to which a pointer was found) are not shown. ==22768== To see them, rerun with: --leak-check=full --show-leak-kinds=all ==22768== ==22768== For counts of detected and suppressed errors, rerun with: -v ==22768== ERROR SUMMARY: 11 errors from 4 contexts (suppressed: 0 from 0) ==22768== could not unlink /tmp/vgdb-pipe-from-vgdb-to-22768-by-root-on-??? ==22768== could not unlink /tmp/vgdb-pipe-to-vgdb-from-22768-by-root-on-??? ==22768== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-22768-by-root-on-???

If You need more of the log i can send it to you (I would not like to show all connections beeing made publicly in this tracker so I would send it to you via (ecrypted?) email

yrutschle commented 8 years ago

Sorry, I forgot to ask you to add the verbose flag:

On Sat, Dec 19, 2015 at 04:10:42AM -0800, kecksk0enig wrote:

==22768== Command: /usr/sbin/sslh --user sslh --listen 87.106.9.235:443 --listen 87.106.9.235:11372 --listen 87.106.9.235:11371 --ssl 127.0.0.1:443 --http 127.0.0.2:11371 --ssh 127.0.0.1:22 --pidfile /var/run/sslh/sslh.pid -f

Can you do this again with -v at the end as well?

forward to ssh failed:connect: No buffer space available forward to ssh failed:connect: No buffer space available Activity on fd2 while probing, impossible state: 1 fd 11, 0 deferred fd -1, 0 deferred

Definitely not normal, this message ("impossible") is printed if activity occurs on a non-initialised file descriptor (

[...]

==22768== 248,799 bytes in 1,593 blocks are definitely lost in loss record 12 of 12 ==22768== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==22768== by 0x404085: defer_write (common.c:200) ==22768== by 0x406A04: probe_client_protocol (probe.c:314) ==22768== by 0x4035C3: main_loop (sslh-select.c:309) ==22768== by 0x406241: main (sslh-main.c:594)

There is a whole lot of data in deferred buffers, which is a bit weird (sslh stops reading incoming data when it can't write it to the corresponding outgoing socket). This suggests a bunch of incoming connections that aren't processed by the Web server, but it's hard to be sure.

I'll try to see if I can reproduce this with memory restrictions, lots of incoming connections and a sluggish backend.

I should warn you that investigating this is likely going to take some time :-) and I appreciate your taking time for this feedback. (It's frustrating when someone comes with a bug report, then never replieѕ when fixes are suggested...)

Y.

kecksk0enig commented 8 years ago

Hi! Sure since the initial bug report the time to crash has increased. First it crashed at least once a day now it works "stable" ;) for more days.

I think it correlates to high load (when clamd has to scan emails or such ressourcs/ram intensive tasks)

yrutschle commented 8 years ago

On Tue, Dec 22, 2015 at 06:49:58AM -0800, kecksk0enig wrote:

Sure since the initial bug report the time to crash has increased. First it crashed at least once a day now it works "stable" ;) for more days.

Heisenbug: it changes when observed!

I think it correlates to high load (when clamd has to scan emails or such ressourcs/ram intensive tasks)

Probably: sslh has already been used on high load systems, and probably on small systems (I think it's in OpenWRT), but I don't think it's been tested when both conditions occur at the same time.

kecksk0enig commented 8 years ago

First: Happy New Year ;) Second: it crashed again ;) the log however is much to large to paste here (I dont even see a real cause since there are no timestamps in the log)

I would like to send you the compressed log so you yourself could decide what is usefoul information and what not.

Where should i sent the log to? Do you have a PGP-Key so I could do this encrypted? Thank you in advance

excerpt

selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
selecting... max_fd=148 num_probing=6
flushing deferred data to fd 6
processing fd1 slot 5
Activity on fd2 while probing, impossible
state: 1
fd 14, 0 deferred
fd -1, 0 deferred
==9321== 
==9321== HEAP SUMMARY:
==9321==     in use at exit: 416,264 bytes in 1,281 blocks
==9321==   total heap usage: 1,203,174 allocs, 1,201,918 frees, 690,997,118 bytes allocated
==9321== 
==9321== 20,728 bytes in 3 blocks are definitely lost in loss record 10 of 12
==9321==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9321==    by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9321==    by 0x404085: defer_write (common.c:200)
==9321==    by 0x40445C: fd2fd (common.c:313)
==9321==    by 0x40287A: shovel (sslh-select.c:167)
==9321==    by 0x403782: main_loop (sslh-select.c:331)
==9321==    by 0x406241: main (sslh-main.c:594)
==9321== 
==9321== 180,224 bytes in 22 blocks are definitely lost in loss record 11 of 12
==9321==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9321==    by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9321==    by 0x404085: defer_write (common.c:200)
==9321==    by 0x404408: fd2fd (common.c:303)
==9321==    by 0x40287A: shovel (sslh-select.c:167)
==9321==    by 0x403782: main_loop (sslh-select.c:331)
==9321==    by 0x406241: main (sslh-main.c:594)
==9321== 
==9321== 197,769 bytes in 1,242 blocks are definitely lost in loss record 12 of 12
==9321==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9321==    by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9321==    by 0x404085: defer_write (common.c:200)
==9321==    by 0x406A04: probe_client_protocol (probe.c:314)
==9321==    by 0x4035C3: main_loop (sslh-select.c:309)
==9321==    by 0x406241: main (sslh-main.c:594)
==9321== 
==9321== LEAK SUMMARY:
==9321==    definitely lost: 398,721 bytes in 1,267 blocks
==9321==    indirectly lost: 0 bytes in 0 blocks
==9321==      possibly lost: 0 bytes in 0 blocks
==9321==    still reachable: 17,543 bytes in 14 blocks
==9321==         suppressed: 0 bytes in 0 blocks
==9321== Reachable blocks (those to which a pointer was found) are not shown.
==9321== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==9321== 
==9321== For counts of detected and suppressed errors, rerun with: -v
==9321== ERROR SUMMARY: 28 errors from 4 contexts (suppressed: 0 from 0)
==9321== could not unlink /tmp/vgdb-pipe-from-vgdb-to-9321-by-root-on-???
==9321== could not unlink /tmp/vgdb-pipe-to-vgdb-from-9321-by-root-on-???
==9321== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-9321-by-root-on-???

there are huge amounts of those probing messages only...

yrutschle commented 8 years ago

I have found at least one bug that could typically occur under heavy load and on very rare occasions, and am investigating another one (or not -- I am not sure yet).

I appear to have misplaced my personal GPG key on a PC that's "secure" (i.e. switched off and far, far away). You can send an e-mail to my work address: yves.rutschle at mdal.fr, I know where the key to that one is :-)

kecksk0enig commented 8 years ago

Hi currently there is no gpg key for yves.rutschle at mdal.fr only the one with fingerprint 0x6C056F7AC7934136 which I suppose is the "secured and unreachable" key.

Standing by ;)

yrutschle commented 8 years ago

Indeed -- the mdal.fr has fingerprint 61B8 40DE 4C69 3C2D C1A8 92D1 D41C 24E1 C28B 6B0D and I've never pushed it to a keyserv (I seem to have bad luck with keyservs). I'd rather send you the public key by e-mail rather than spamming it here if that's ok with you (from a security point of view, I think the fingerprint here, along with my being logged in to comment, is as good as pulling my code from github, but I may be wrong ;) ).

kecksk0enig commented 8 years ago

Well then ping me at one of the UIDs of 0xCF0865D20A096F7B and I reply with the logfile :)

jribal commented 6 years ago

Hi, I also suffer from this issue using debian with sslh-select version 1.18. It's a weak machine with random high loads and about 2go of ram left. It happens more and more often the longer the machine is up. It's up for 15 days and it crashes at least once a day for the 5 past days.

Feb 14 13:10:28 system sslh-select[14199]: munmap_chunk(): invalid pointer Feb 14 13:10:28 system systemd[1]: sslh.service: Main process exited, code=killed, status=6/ABRT Feb 14 13:10:28 system systemd[1]: sslh.service: Unit entered failed state. Feb 14 13:10:28 system systemd[1]: sslh.service: Failed with result 'signal'.

Should I launch the sslh with valgrind to provide more infos ?

yrutschle commented 6 years ago

Maybe just gdb so we know where it complains, along with sslh -v to know more about what happens. The message suggests a free() on a pointer that wasn't allocated, and I see a couple of allocations where return values aren't checked, which could explain that. I'll have to fix those anyways...

jribal commented 6 years ago

gdb give me this log after a crash

Program received signal SIGSEGV, Segmentation fault. 0x00007feaf8721403 in free () from /lib/x86_64-linux-gnu/libc.so.6

yrutschle commented 6 years ago

ah, I was hoping for more of a backtrace.

This crash should create a core file (maybe you'll need to type ulimit -c unlimited before running sslh).

Can you load the core file in gdb:

gdb sslh core

and then ask gdb to give you the backtrace:

gdb> bt

thanks!

jribal commented 6 years ago

Hi, It finally happened again:

0 0x00007fcc193e16a0 in raise () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007fcc193e2cf7 in abort () from /lib/x86_64-linux-gnu/libc.so.6

2 0x00007fcc19423f87 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

3 0x00007fcc1942a27a in ?? () from /lib/x86_64-linux-gnu/libc.so.6

4 0x00007fcc1942bd2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6

5 0x0000555555556d89 in ?? ()

6 0x0000555555557abf in ?? ()

7 0x00005555555568ec in ?? ()

8 0x00007fcc193cdf2a in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6

9 0x0000555555556b69 in ?? ()

I also have the core file. I guess I need to run a debuggable version instead of debian one. I will move a docker version and see how it runs.

Thank you for your efforts.

yrutschle commented 6 years ago

On Wed, Feb 28, 2018 at 03:13:31PM +0000, jribal wrote:

I guess I need to run a debuggable version instead of debian one. I will move a docker version and see how it runs.

Ah, yes debuggable would be better. Even better than a docker one would be to run from the github head, if you can be bothered. But probably it didn't change so much that the backtrace would be useless.

Y.

jribal commented 6 years ago

Hi, I'm giving my feedback.

It runs without an issue for 30 days even with very low ram with this image I can't say if the latest version are better or only docker helps to keep stability.

Thanks for this project and your help.

yrutschle commented 6 years ago

Interesting -- do you know if maybe the docker automatically restarts sslh?

jribal commented 6 years ago

I forgot to check the container uptime before I reboot my machine. It should not have restart sslh on it's own. I'll post again next time I have a 30 days uptime on sslh docker.

yrutschle commented 3 years ago

sslh is generally more defensive towards memory allocation failures than "back then", so I'm closing this assuming it's fixed. Please create a new issue if you consider it isn't.