Closed gsomlo closed 2 years ago
Running iodine through a caching resolver on the client seems like it would only cause problems yes. The simplest workaround is to specify a DNS server for the client to use (instead of falling back to the system default)
Thanks for the reply! However, to clarify:
On Sat, Mar 26, 2022 at 07:45:24AM -0700, Erik Ekman wrote:
Running iodine through a caching resolver on the client seems like it would only cause problems yes.
The cache is not running on the client, but rather on a different machine located on the same subnet as the client.
When the cache runs vanilla bind-9.11. things work (with the config file shown in the OP). When the (same) cache runs (also vanilla) bind-9.16-, the cache seems to inject some as-of-yet unclear delay into (some subset of) the queries generated by the client.
I'm trying to find out if anyone has run into this and perhaps can narrow down the individual change among the 9.11 -> 9.16 delta that's causing this behavior.
I'm essentially trying to avoid having to figure out how to bisect the whole bind 9.11 -> 9.16 change set (which is quite a non-trivial undertaking, to say the least) :)
Thanks for any additional clue anyone might be able to share!
OK. The workaround mentioned should still work though - I don't have any hints about the bind setup. Without packet captures (from different viewpoints) comparing the two it is hard to assist.
Thanks, I started out by trying to see if this is something that's already been at least seen in the wild, if not understood and solved :)
Since that's apparently not the case:
On Sat, Mar 26, 2022 at 08:19:35AM -0700, Erik Ekman wrote:
I don't have any hints about the bind setup.
I guess this is my cue to come clean and say this is an experiment run inside a netns-container based simulator (https://github.com/coreemu/core) and all three "elements" (iodine client, iodine server, and the caching bind server used to relay traffic between client and server) are deployed inside their own containers. So is the rest of the DNS infrastructure (root, top level, etc. authorities), for that matter.
Without packet captures (from different viewpoints) comparing the two it is hard to assist.
As such, I should be able to collect full packet (and dns cache) logs in both scenarios (working, when the cache runs bind-9.11., and broken, when bind-9.16. is used instead), and post them somewhere for inspection. I wasn't expecting anyone would care enough to look at them, but if you would, that'd be awesome, and much appreciated!
I should have something available by tomorrow, hopefully...
So, with the cache named.conf shown in the OP, I collected packet traces on all three machines involved:
tshark -i eth0 -s 2048 -w /tmp/[server|cache|client].pcap
.I then started iodined
on the server: iodined -P foo 192.168.14.1 evl.edu
, and then finally started the client: iodine -r -P foo evl.edu
.
iodine_with_bind_9.11.zip shows the case where it works fine, the tunnel is established without any issues.
iodine_with_bind_9.16.zip shows when the tunnel fails to be brought up -- there's also a log of the client's command line output included.
The environment is absolutely identical in both cases, with the sole difference of the bind package version (9.11 when it works, 9.16 when it does not work).
The client log also has a time taken right after it fails, and I noticed a bunch of DNS traffic being exchanged between the client and cache, and also the cache and the iodine server, for 5-10 seconds after the client gave up.
Finally, here's a picture of the network map showing the topology of the delta-user
client, delta-dns
cache, and ns.evl.edu
server:
Thanks again for taking a look, any further ideas much appreciated!
@yarrick -- I just completed a bisect on bind, and it seems the issue I'm seeing with the bind-9.16 cache is caused by commit https://github.com/isc-projects/bind9/commit/ae52c2117eba9fa0778125f4e10834d673ab811b
I don't yet understand how or why, but if I'm using a bind-9 cache to relay iodine traffic, any version that has that commit applied (at least along the 9.16 branch) will cause a failure to bring up the tunnel.
Long story short, the (new?) default in named.conf
is qname-minimization relaxed;
. Together with the commit I found earlier today via bisect, that setting leads to a bind-9.16 cache failing to allow iodine to bring up a tunnel.
Explicitly setting that option to either qname-minimization strict;
or to qname-minimization disabled;
will allow iodine to successfully use the cache for relaying/tunnelling once again. The bind-9 docs for qname-minimization
suggest that they might switch to strict
as the default option instead of relaxed
, but I have no idea what time frame that might imply.
Either way, figured I'd update this issue so anyone else googling for the symptoms might find an explanation a bit less painfully than I had to ; )
I'm trying to bring up an iodine tunnel from a client through a vanilla-configured cache running on Fedora 36. The cache config file is:
I'm starting the server using
and the client using
The symptom I'm observing is that the cache introduces a huge (around 5 second) delay into the forwarding of some of the client-generated queries (e.g.
raacdMAj...
, which I think are part of the downstream path fragment size negotiation).What I'm seeing is that the client gives up on progressively smaller attempted downstream path fragment sizes, until it fails entirely. A few seconds later, the cache starts sending a bunch of the wrongfully-delayed client-generated queries to the server, which immediately replies to them, and the cache then logs them as
query failed (timed out)
in its/var/named/data/named.run
log file.This all used to work perfectly well back when Fedora used
named-9.11.*
, and building the old rpm packages ofnamed-9.11
for Fedora-36 and downgrading to them there gets it working perfectly well, so I suspect the problem is a bug (or weird new configuration default) inbind-9.16.*
.I understand this does not appear to be a problem with iodine itself, but rather with
bind-9.16.*
, but figured if anyone would know or care, or notice the problem, they'd be more likely to be found here than on some Fedora or bind-9 mailing list :)Any clue on what might be going on much appreciated -- thanks!