can't bring up tunnel through Fedora (36) bind-9.16.27

gsomlo commented 2 years ago

I'm trying to bring up an iodine tunnel from a client through a vanilla-configured cache running on Fedora 36. The cache config file is:

# /etc/named/named.conf
options {
    listen-on port 53 { any; };
    allow-query { any; };
    recursion yes;
    dnssec-validation no;
    querylog yes;
    directory "/var/named";
    dump-file "/var/named/data/cache_dump.db";
    statistics-file "/var/named/data/named_stats.txt";
    memstatistics-file "/var/named/data/named_mem_stats.txt";
    pid-file "/run/named/named.pid";
    session-keyfile "/run/named/session.key";
};

logging {
    channel default_debug {
        file "data/named.run";
        severity dynamic;
    };
};

zone "." IN {
    type hint;
    file "named.ca";
};

I'm starting the server using

iodined -P foobar 192.168.14.1 -DD -f test.com

and the client using

iodine -r -P foobar test.com

The symptom I'm observing is that the cache introduces a huge (around 5 second) delay into the forwarding of some of the client-generated queries (e.g. raacdMAj..., which I think are part of the downstream path fragment size negotiation).

What I'm seeing is that the client gives up on progressively smaller attempted downstream path fragment sizes, until it fails entirely. A few seconds later, the cache starts sending a bunch of the wrongfully-delayed client-generated queries to the server, which immediately replies to them, and the cache then logs them as query failed (timed out) in its /var/named/data/named.run log file.

This all used to work perfectly well back when Fedora used named-9.11.*, and building the old rpm packages of named-9.11 for Fedora-36 and downgrading to them there gets it working perfectly well, so I suspect the problem is a bug (or weird new configuration default) in bind-9.16.*.

I understand this does not appear to be a problem with iodine itself, but rather with bind-9.16.*, but figured if anyone would know or care, or notice the problem, they'd be more likely to be found here than on some Fedora or bind-9 mailing list :)

Any clue on what might be going on much appreciated -- thanks!

yarrick commented 2 years ago

Running iodine through a caching resolver on the client seems like it would only cause problems yes. The simplest workaround is to specify a DNS server for the client to use (instead of falling back to the system default)

gsomlo commented 2 years ago

Thanks for the reply! However, to clarify:

On Sat, Mar 26, 2022 at 07:45:24AM -0700, Erik Ekman wrote:

Running iodine through a caching resolver on the client seems like it would only cause problems yes.

The cache is not running on the client, but rather on a different machine located on the same subnet as the client.

When the cache runs vanilla bind-9.11. things work (with the config file shown in the OP). When the (same) cache runs (also vanilla) bind-9.16-, the cache seems to inject some as-of-yet unclear delay into (some subset of) the queries generated by the client.

I'm trying to find out if anyone has run into this and perhaps can narrow down the individual change among the 9.11 -> 9.16 delta that's causing this behavior.

I'm essentially trying to avoid having to figure out how to bisect the whole bind 9.11 -> 9.16 change set (which is quite a non-trivial undertaking, to say the least) :)

Thanks for any additional clue anyone might be able to share!

yarrick commented 2 years ago

OK. The workaround mentioned should still work though - I don't have any hints about the bind setup. Without packet captures (from different viewpoints) comparing the two it is hard to assist.

gsomlo commented 2 years ago

Thanks, I started out by trying to see if this is something that's already been at least seen in the wild, if not understood and solved :)

Since that's apparently not the case:

On Sat, Mar 26, 2022 at 08:19:35AM -0700, Erik Ekman wrote:

I don't have any hints about the bind setup.

I guess this is my cue to come clean and say this is an experiment run inside a netns-container based simulator (https://github.com/coreemu/core) and all three "elements" (iodine client, iodine server, and the caching bind server used to relay traffic between client and server) are deployed inside their own containers. So is the rest of the DNS infrastructure (root, top level, etc. authorities), for that matter.

Without packet captures (from different viewpoints) comparing the two it is hard to assist.

As such, I should be able to collect full packet (and dns cache) logs in both scenarios (working, when the cache runs bind-9.11., and broken, when bind-9.16. is used instead), and post them somewhere for inspection. I wasn't expecting anyone would care enough to look at them, but if you would, that'd be awesome, and much appreciated!

I should have something available by tomorrow, hopefully...

gsomlo commented 2 years ago

So, with the cache named.conf shown in the OP, I collected packet traces on all three machines involved:

server (ns.evl.edu, 81.22.162.10)
cache (delta-dns, 12.0.0.43)
client (delta-user, 12.0.0.46) using tshark -i eth0 -s 2048 -w /tmp/[server|cache|client].pcap.

I then started iodined on the server: iodined -P foo 192.168.14.1 evl.edu, and then finally started the client: iodine -r -P foo evl.edu.

iodine_with_bind_9.11.zip shows the case where it works fine, the tunnel is established without any issues.

iodine_with_bind_9.16.zip shows when the tunnel fails to be brought up -- there's also a log of the client's command line output included.

The environment is absolutely identical in both cases, with the sole difference of the bind package version (9.11 when it works, 9.16 when it does not work).

The client log also has a time taken right after it fails, and I noticed a bunch of DNS traffic being exchanged between the client and cache, and also the cache and the iodine server, for 5-10 seconds after the client gave up.

Finally, here's a picture of the network map showing the topology of the delta-user client, delta-dns cache, and ns.evl.edu server: iodine_map

Thanks again for taking a look, any further ideas much appreciated!

gsomlo commented 2 years ago

@yarrick -- I just completed a bisect on bind, and it seems the issue I'm seeing with the bind-9.16 cache is caused by commit https://github.com/isc-projects/bind9/commit/ae52c2117eba9fa0778125f4e10834d673ab811b

I don't yet understand how or why, but if I'm using a bind-9 cache to relay iodine traffic, any version that has that commit applied (at least along the 9.16 branch) will cause a failure to bring up the tunnel.

gsomlo commented 2 years ago

Long story short, the (new?) default in named.conf is qname-minimization relaxed;. Together with the commit I found earlier today via bisect, that setting leads to a bind-9.16 cache failing to allow iodine to bring up a tunnel.

Explicitly setting that option to either qname-minimization strict; or to qname-minimization disabled; will allow iodine to successfully use the cache for relaying/tunnelling once again. The bind-9 docs for qname-minimization suggest that they might switch to strict as the default option instead of relaxed, but I have no idea what time frame that might imply.

Either way, figured I'd update this issue so anyone else googling for the symptoms might find an explanation a bit less painfully than I had to ; )

yarrick / iodine

can't bring up tunnel through Fedora (36) bind-9.16.27 #72