Closed kaechele closed 4 years ago
Debian oldstable (stretch) currently only offers kernels that are affected by this bug; I submitted a bugreport about updating the backport kernels (it's already a 4.19 kernel, but not new enough).
The Debian stable (buster) kernel is fine (4.19.98). On Ubuntu 18.04 (bionic), install linux-generic-hwe-18.04
(at least this changelog says that should suffice).
Could u say something what happens with old kernels? I updated the version to 19.07.4 with 4.14.180 or 4.19.X kernel. I don't see any obvious issues with that kernel and the version while running tunneldigger. I will investigate further.
Clients were unable to reliably connect with broken kernels, so we saw many connection timeouts or other disconnects in the logs. Also see https://github.com/wlanslovenija/tunneldigger/issues/129. You should also see warnings specifically pointing out that the kernel is likely buggy.
Maybe Ubuntu backported the problematic patches, who knows. (I assume you are using Ubuntu? You didn't state the distro you are using.)
Maybe Ubuntu backported the problematic patches, who knows. (I assume you are using Ubuntu? You didn't state the distro you are using.)
OpenWrt. Thanks I will have a look at the log.
This issue is about the tunneldigger broker. Are you really running that server-side component on OpenWrt?
This issue is about the tunneldigger broker. Are you really running that server-side component on OpenWrt?
No. :O Sry, than everytihng is fine. :D
Would using SO_REUSEADDR
instead of SO_REUSEPORT
be an option? At least using a short test program, kernel 4.19 doesn't seem to show this bug with SO_REUSEADDR
(I have not checked older kernels).
While implementing L2TP support for fastd (still work in progress), I noticed another advantage of SO_REUSEADDR
: It can be set after bind() of the first socket, while SO_REUSEPORT
needs to be set before bind(), which may accidentally allow two processes of the same user to bind to the same port.
With SO_REUSEADDR
this can be prevented: Let a process bind its first socket without SO_REUSEADDR
; this will fail if the port is already bound by another process. Then set SO_REUSEADDR
on the first socket. On subsequent sockets, set SO_REUSEADDR
before bind(), so they are allowed to use the same port as the first socket.
Would using SO_REUSEADDR instead of SO_REUSEPORT be an option?
I have to admit I am out of my league here; the differences between these flags are beyond my experience in this space. @kaechele did the implementation with SO_REUSEPORT, he might be able to comment. Other than that, if someone writes a PR that switches to SO_REUSEADDR, I'd be willing to test that on our servers and merge it if it works.
I initially implemented this using SO_REUSEPORT
as my research suggested this to be best practice from a security standpoint.
Your way of utilizing SO_REUSEADDR
looks like a smart way to avoid double-binding a port already in use by the same user but for a different application.
Correct me if I'm wrong here but it looks like you trade the same-user bind protection for protection of user error in this case.
It seems like an edge case scenario that some other (malicious) user on the same machine would try to abuse a reused port to intercept or alter traffic. Given that L2TPv3 is not encrypted or authenticated anyway. So this is a sensible trade-off in my eyes.
I don't know if I have an immediate need to switch the current implementation over to SO_REUSEADDR
but I'm sure it would be a quick thing to do anyway.
In any case I'm looking forward to playing with fastd's implementation as I love the idea of flexibility in selecting L2TP as an option if I require speed over security.
Correct me if I'm wrong here but it looks like you trade the same-user bind protection for protection of user error in this case. It seems like an edge case scenario that some other (malicious) user on the same machine would try to abuse a reused port to intercept or alter traffic.
This is correct. If running on the same machine as untrusted users, only using low ports for L2TP would mitigate the issue.
So, we (freifunk berlin) have been trying to use the NAT-removed version and have run into some strange issues. It seems like if an in-between router which is also doing NAT (perhaps with an older kernel) then the post-NAT-removal doesn't work and the tunnels time out. It's stange because it works for some people, and not for others. And the only difference we can find is in the router in-between. For example, it works with a recent openwrt image just fine, but with a fritz 7590 with firware 7.50 it doesn't
We have reverted to 7c467e68021526b8631e8a53a9022aa223
Sounds like an issue unrelated to this Kernel bug, possibly in the NAT implementation of the faulty routers. In any case, it would probably be best to open a separate issue and attach some debugging information so the issue can be looked into. Good debugging info would be excerpts of the conntrack table from affected routers or maybe even packet captures.
Newer versions of the Tunneldigger broker use
SO_REUSEPORT
to process multiple tunnels on one single port. In the past Tunneldigger used a NAT-based workaround to make this work. To simplify the code and remove unnecessary dependencies this workaround was removed. Unfortunately there are several kernel bugs that preventSO_REUSEPORT
for UDP sockets from working properly, that are only fixed in fairly recent kernels. This means that the change in conjunction with the bug has some peculiar implications for which Kernel versions can be used for brokers. (Tunneldigger clients are unaffected by all of this.)Kernel versions 5.10.152 and newer exhibit the correct behaviour and should work.
You have probably landed here because you still use an older Linux distribution or haven't updated to a working Kernel version. If you are experiencing this issue you have two options:
Kernel fixes
For the curious among you, the two fixes that are needed are:
69421bf98482d089e50799f45e48b25ce4a8d154
below)