yrutschle / sslh

Applicative Protocol Multiplexer (e.g. share SSH and HTTPS on the same port)
https://www.rutschle.net/tech/sslh/README.html
GNU General Public License v2.0
4.56k stars 368 forks source link

sslh-fork coredump #273

Closed myelsukov closed 4 years ago

myelsukov commented 4 years ago

sslh-fork coredumps on accepting ssl connection. ssh connections are coming through just fine.

> uname -a
Linux pi 5.4.51-1-ARCH #1 SMP PREEMPT Wed Jul 15 02:49:03 UTC 2020 armv7l GNU/Linux

After this it was still OK:

[2020-07-13T19:32:28-0700] [ALPM] upgraded sslh (1.20-1 -> 1.20-2)

Got broken after this:

[2020-07-23T23:35:31-0700] [ALPM] upgraded sslh (1.20-2 -> 1.21b-1)
myelsukov commented 4 years ago

Correction: 1.20-2 fails too. Downgrade to 1.20-1 fixed the problem.

yrutschle commented 4 years ago

This is the most common use case, which suggests you must be running on a rather obscure configuration (otherwise it'd have failed for many others). Also, there is only one 1.20 release, so I'm tempted to ask what distribution are you running on and what's the difference between 1.20-1 and 1.20-2?

All that being said:

Perdu commented 4 years ago

I have the exact same problem on a fresh archlinuxarm install: sslh-fork coredumps on ssl connections, but ssh connections work fine. It's not just @myelsukov

yrutschle commented 4 years ago

Ok, I still need a back trace :-)

Perdu commented 4 years ago

OK I'll try to do that. Not sure if this is useful, but here's the journalctl error:

Process 42258 (sslh) of user 63671 dumped core.

                                               Stack trace of thread 42258:
                                               #0  0x0000aaaadcbc80d0 n/a (sslh-fork + 0xb0d0)
                                               #1  0x0000aaaadcbc733c n/a (sslh-fork + 0xa33c)
                                               #2  0x0000aaaadcbc7b10 n/a (sslh-fork + 0xab10)
                                               #3  0x0000aaaadcbc1e98 n/a (sslh-fork + 0x4e98)
                                               #4  0x0000aaaadcbc21f8 n/a (sslh-fork + 0x51f8)
                                               #5  0x0000aaaadcbc16c8 n/a (sslh-fork + 0x46c8)
                                               #6  0x0000ffff992f9d90 __libc_start_main (libc.so.6 + 0x23d90)
                                               #7  0x0000aaaadcbc19bc n/a (sslh-fork + 0x49bc)
                                               #8  0x0000aaaadcbc19bc n/a (sslh-fork + 0x49bc)
yrutschle commented 4 years ago

Hmmm it looks like the binary was stripped. It needs to be compiled with -g which is the distribution default. Can you try with a freshly cloned head from GitHub?

myelsukov commented 4 years ago

As I mentioned in the ticket I am running vanilla Arch Linux 5.4.51-1-ARCH on ARMV7 (Raspberry Pi) I don't know if it is important or not but I am running sslh as transparent proxy.

All versions prior to 1.20 worked like a clockwork for years so it is something that was introduced recently.

I will try to find time to experiment with a locally built sslh if my grandkids allow that :)

myelsukov commented 4 years ago

Well, that was easy :grin:

           PID: 8660 (sslh-fork)
           UID: 979 (sslh)
           GID: 979 (sslh)
        Signal: 11 (SEGV)
     Timestamp: Sat 2020-07-25 10:22:21 PDT (19s ago)
  Command Line: /usr/bin/sslh-fork --config /etc/sslh.cfg --foreground
    Executable: /usr/bin/sslh-fork
 Control Group: /system.slice/sslh-fork.service
          Unit: sslh-fork.service
         Slice: system.slice
       Boot ID: eea67b257c5b43938cda7da51ecfb96f
    Machine ID: aa8463431c294aa9b55ffb978fc87d69
      Hostname: pi
       Storage: /var/lib/systemd/coredump/core.sslh-fork.979.eea67b257c5b43938cda7da51ecfb96f.8660.1595697741000000000000.lz4
       Message: Process 8660 (sslh-fork) of user 979 dumped core.

                Stack trace of thread 8660:
                #0  0x0000000000486a44 parse_tls_header (sslh-fork + 0xaa44)
                #1  0x0000000000485cf0 is_tls_protocol (sslh-fork + 0x9cf0)
                #2  0x00000000004863a0 probe_client_protocol (sslh-fork + 0xa3a0)
                #3  0x000000000047e8f8 start_shoveler (sslh-fork + 0x28f8)
                #4  0x000000000047ee9c main_loop (sslh-fork + 0x2e9c)
                #5  0x00000000004854fc main (sslh-fork + 0x94fc)
                #6  0x0000000076e09bb8 __libc_start_main (libc.so.6 + 0x1abb8)
yrutschle commented 4 years ago

This is odd, that's all code that hasn't changed in a while. It should help a lot if that's deterministic.

Could you run with --verbose 2 and report the trace? Also, do you know if your target is big endian or little endian?

myelsukov commented 4 years ago

It is 100% reproducible. Unfortunately adding --verbose 2 to the command line did not produce any additional output. The CPU is running in the little endian mode (according to uname -a it is armv7l

Version 1.19c which is running perfectly well produces this output:

Jul 25 14:21:44 pi sslh-fork[12329]: accepted fd 4
Jul 25 14:21:44 pi sslh-fork[12375]: **** writing deferred on fd -1
Jul 25 14:21:44 pi sslh-fork[12375]: probing for ssh
Jul 25 14:21:44 pi sslh-fork[12375]: probing for ssl
Jul 25 14:21:44 pi sslh-fork[12375]: connecting to xxxxxxxxxxxx.lan:https family 2 len 16
Jul 25 14:21:44 pi sslh-fork[12375]: ssl:connection from xxxxxxx.lan:54172 to myself:https forwarded from xxxxxxx.lan:54172 to xxxxxxxxxxxx.lan:https
Jul 25 14:21:44 pi sslh-fork[12375]: flushing deferred data to fd 3
yrutschle commented 4 years ago

I like reproducible:-)

It was mentioned earlier two versions of 1.20, one of which worked: could you try 1.20 from GitHub, and 1.21b from GitHub as well? (Trying to reduce to the change that breaks it).

The function parse_tls_header doesn't look wrong at a first glance, but it's not totally trivial either. And it does a lot of pointer arithmetics...

myelsukov commented 4 years ago

v1.20 works fine:

Jul 26 09:30:39 pi sslh-fork[15126]: Using /etc/sslh.cfg
Jul 26 09:30:39 pi sslh-fork[15126]: ssh addr: myself:ethernet-ip-s. libwrap service: (null) log_level: 1 family 2 2 [] []
Jul 26 09:30:39 pi sslh-fork[15126]: tls addr: xxxxxxxxxxxx.lan:https. libwrap service: (null) log_level: 1 family 2 2 [] []
Jul 26 09:30:39 pi sslh-fork[15126]: listening on:
Jul 26 09:30:39 pi sslh-fork[15126]:         0.0.0.0:https        []
Jul 26 09:30:39 pi sslh-fork[15126]: timeout: 2
Jul 26 09:30:39 pi sslh-fork[15126]: on-timeout: ssh
Jul 26 09:30:39 pi sslh-fork[15126]: listening to 1 addresses
Jul 26 09:30:39 pi sslh-fork[15126]: sslh-fork v1.20 started
Jul 26 09:30:43 pi sslh-fork[15135]: accepted fd 4
Jul 26 09:30:43 pi sslh-fork[15135]: accepted fd 4
Jul 26 09:30:43 pi sslh-fork[15169]: **** writing deferred on fd -1
Jul 26 09:30:43 pi sslh-fork[15169]: probing for ssh: PROBE_NEXT
Jul 26 09:30:43 pi sslh-fork[15169]: probing for tls: PROBE_MATCH
Jul 26 09:30:43 pi sslh-fork[15169]: connecting to xxxxxxxxxxxx.lan:https family 2 len 16
Jul 26 09:30:43 pi sslh-fork[15169]: tls:connection from xxxxxxx.lan:60913 to myself:https forwarded from xxxxxxx.lan:60913 to xxxxxxxxxxxx.lan:https
Jul 26 09:30:43 pi sslh-fork[15169]: flushing deferred data to fd 3
Jul 26 09:30:45 pi sslh-fork[15168]: timed out, connect to ssh
Jul 26 09:30:45 pi sslh-fork[15168]: connecting to myself:ethernet-ip-s family 2 len 16
Jul 26 09:30:45 pi sslh-fork[15168]: ssh:connection from xxxxxxx.lan:60912 to myself:https forwarded from xxxxxxx.lan:60912 to myself:ethernet-ip-s
Jul 26 09:30:45 pi sslh-fork[15168]: flushing deferred data to fd 3
Jul 26 09:30:45 pi sslh-fork[15169]: client socket closed
Jul 26 09:30:45 pi sslh-fork[15169]: connection closed down
yrutschle commented 4 years ago

I just realised that my server is ARM as well -- so that's not it.

I can't see a major difference between 1.20 and 1.21 that would explain this. I'll dig a bit deeper.

yrutschle commented 4 years ago

Ok, valgrind (which checks validity of memory accesses) seems happy on my system.

Can you:

myelsukov commented 4 years ago

Running under valgrind does not fail. I am getting this trace:

sudo valgrind --leak-check=full ./sslh-fork -v 2 -f -F /etc/sslh.cfg
==20037== Memcheck, a memory error detector
==20037== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==20037== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==20037== Command: ./sslh-fork -v 2 -f -F /etc/sslh.cfg
==20037==
sslh-fork v1.21b-1-g2c93a01-dirty started
--20037-- WARNING: unhandled arm-linux syscall: 403
--20037-- You may be able to write your own handler.
--20037-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--20037-- Nevertheless we consider this a bug.  Please report
--20037-- it at http://valgrind.org/support/bug_reports.html.
==20040== Conditional jump or move depends on uninitialised value(s)
==20040==    at 0x112A3C: parse_tls_header (tls.c:162)
==20040==    by 0x111CEF: is_tls_protocol (probe.c:214)
==20040==    by 0x11239F: probe_client_protocol (probe.c:366)
==20040==    by 0x10A8F7: start_shoveler (sslh-fork.c:98)
==20040==    by 0x10AE9B: main_loop (sslh-fork.c:200)
==20040==    by 0x1114FB: main (sslh-main.c:322)
==20040==
ssl:connection from xxxxxxx.lan:49110 to myself:https forwarded from xxxxxxx.lan:49110 to xxxxxxxxxxxx.lan:https
myelsukov commented 4 years ago

And the build log is:

[14:45 pi@pi sslh (master *=) ] > make
Makefile:74: warning: ignoring prerequisites on suffix rule definition
./genver.sh >version.h
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c sslh-conf.c
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c common.c
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c sslh-main.c
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c probe.c
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c tls.c
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c argtable3.c
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c sslh-fork.c
cc -Wall -g  -O0  -o sslh-fork sslh-fork.o sslh-conf.o common.o sslh-main.o probe.o tls.o argtable3.o  -lpcreposix -lconfig
#strip sslh-fork
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c sslh-select.c
cc -Wall -g  -O0  -o sslh-select sslh-select.o sslh-conf.o common.o sslh-main.o probe.o tls.o argtable3.o  -lpcreposix -lconfig
#strip sslh-select
pod2man --section=8 --release=v1.21b-1-g2c93a01-dirty --center=" " sslh.pod | gzip -9 - > sslh.8.gz
cc -Wall -g  -O0 -DENABLE_REGEX -DLIBPCRE -DLIBCONFIG -c echosrv.c
cc -Wall -g  -O0  -o echosrv echosrv.o probe.o common.o tls.o  -lpcreposix -lconfig
yrutschle commented 4 years ago
==20040== Conditional jump or move depends on uninitialised value(s)
==20040==    at 0x112A3C: parse_tls_header (tls.c:162)

Well that's interesting. I can't think of a way tls_data would be uninitialised, and I can't think of a way that would segfault, but I don't see this behaviour on my system and dammit tls_data should be initialised. By code that has changed between v1.20 and v1.21. Can you post your config file? (feel free to e-mail it if you don't want to publish it, which I would understand)

LaserEyess commented 4 years ago

Getting this myself and it's trivial to reproduce, so I will attempt to bisect it. ~Hah well no actually it isn't. I cannot reproduce it without running it in systemd.~

Here's my config

# Default Arch configuration
# You can find more examples in /usr/share/doc/sslh

verbose: false;
foreground: true;
inetd: false;
numeric: true;
transparent: true;
timeout: 2
#pidfile: "/run/sslh.pid";

listen:
(
    { host: "a.b.c.d"; port: "443"; },
    { host: "10.8.0.1"; port: "443"; },
    { host: "10.9.0.1"; port: "443"; }
);

protocols:
(
     { name: "ssh"; service: "ssh"; host: "a.b.c.d"; port: "223"; probe: "builtin"; },
     { name: "openvpn"; host: "a.b.c.d"; port: "1194"; probe: "builtin"; },
     { name: "ssl"; host: "a.b.c.d"; port: "4443"; probe: "builtin"; },
);

# vim:set ts=4 sw=4 et:

The most interesting thing I do is use transparency for ssh logging. a.b.c.d was just a sed of my real IP address.

~Edit: @yrutschle I tried to bisect but I'll need some more help beacuse I got a bogus commit, ed3cd40. Undoubtedly because I struggled to bisect with conf2struct.~

I just learned you can bisect by path and I got e7ce929

I'm getting a crash here:

Backtrace: https://0x0.st/ixcQ.txt

158        return TLS_EPROTOCOL;
159 
160    /* By now we know it's TLS. if SNI or ALPN is set, parse extensions to see if
161     * they match. Otherwise, it's a match already */
162    if (tls_data && 
163        (tls_data->match_mode.tls_match_alpn || tls_data->match_mode.tls_match_sni)) { // <- LINE 163
164        return parse_extensions(tls_data, data + pos, len);
165    } else {
166        return TLS_MATCH;
167    }

It seems that tls_data is just bogus.

myelsukov commented 4 years ago

Can you post your config file? (feel free to e-mail it if you don't want to publish it, which I would understand)

timeout: 2;
verbose: true;
foreground: true;
transparent: true;

listen:
(
    { host: "0.0.0.0"; port: "llll"; }
);

protocols:
(
     { name: "ssh"; host: "xxxxxx"; port: "xxxx"; probe: "builtin"; },
     { name: "ssl"; host: "xxxxxxxxxxxx"; port: "yyyy"; probe: "builtin"; }
);

For what it worth, I am also starting sslh as a systemd service.

yrutschle commented 4 years ago

Tls. Can you change 'ssl' to 'tls' as name for the protocol?

LaserEyess commented 4 years ago

This fixes the issue for me, but why? I don't see any indication in the man page that ssl is deprecated, in fact it looks as if tls is just an alias for it. I see https://github.com/yrutschle/sslh/commit/9ed44ee09b97632139eaa54f8a7b0db7c1dcedd1 but I did not get any deprecation warning when starting sslh.

myelsukov commented 4 years ago

Switching to tls fixed the problem for me too. Now, that's a mystery. I hope you can reproduce and debug on your side. To summarize my setup:

yrutschle commented 4 years ago

Ah. Glad we got to the root cause :-)

So there is in fact a number of issues:

Thanks to everyone who contributed to this investigation. Too often we developers are left with a 'it doesn't work! I don't know what is make!'

LaserEyess commented 4 years ago

Started by systemd didn't actually matter for me in the end. What did matter was transparent though. According to probe.c, ssl should still be valid. My bisect basically goes to the point where you switched everything to the config struct. I can't follow what's going on but it looks like ssl should be the exact same probe tls. Is there somewhere that assumptions breaks down?

yrutschle commented 4 years ago

No no: sslh-main.c:config_protocols() treats 'tls' specifically to initialise the protocol.data field. Plus, conf2struct didn't initialise so-called runtime data. So when set to 'ssl', you end up with protocol.data not initialised, so the test in tls.c:162 breaks (if (tls_data [...] assumes tls_data is either zero or valid, but here it's just garbage).

I've checked in the various changes I intended:

Perdu commented 4 years ago

Switching to "tls" also fixes the issue for me. Thanks!

LaserEyess commented 4 years ago

@yrutschle ok that makes sense, thank you for the explanation, I can see by your latest commits how that could have happened, and how the commit I found was the result of the bisect.

I just tested the latest git master with both ssl and tls and I can confirm this has fixed the issue. ssl is now does not work and produces an error, and tls works as before.