zmap / zdns

Fast DNS Lookup Library and CLI Tool
Apache License 2.0
939 stars 123 forks source link

DoH/DoT/TCP-based lookups and connection re-use #439

Closed phillip-stephens closed 1 month ago

phillip-stephens commented 1 month ago

Adds "name server stickiness" to lookups so Resolvers will prioritize processing queries from the same nameserver to avoid having to re-handshake TCP/TLS/HTTPS.

Changes

Overview

In #431 (which this branch is based on, wanted to have this logic reviewed before merging into that to break up this larger feature), I added functionality with DoH and DoT connections that a given resolver would only re-handshake if the nameserver was different than the remote address on it's existing connection.

The issue is that in ExternalLookup here, if the user didn't provide a nameserver then a random one would be selected. Let's look at an example to see how this causes us to unnecessarily tear-down connections:

./zdns google.com yahoo.com eBay.com --threads=2 --name-servers=1.1.1.1,8.8.8.8 --tls And let's say that after the random NS selection, this gives us

In this toy example, thread #0 would tear down it's TLS connection and re-establish one to 8.8.8.8 even though thread #1 already has a connection to 8.8.8.8.

Load Imbalance

As another design consideration, all external resolvers do not behave equally.

I measured the resolution time for 7k queries and to what resolver they were sent to.

        IP        mean          max         std  count
0  1.0.0.1   94.638474  3996.846985  243.672404   1752
1  1.1.1.1  112.375685  5456.484652  313.006790   1781
2  8.8.4.4   35.020819  2615.063632  101.810439   1732
3  8.8.8.8   31.811647  1253.615863   70.430397   1711

This led to the threads responsible for Google queries sitting idle while the Cloudflare ones were busy working.

Design

To address both re-using TCP connections and dealing with load imbalance, this PR implements a Priority and Global work queue.

A new inputDeMultiplexor chooses an external NS for each input line and passes it to the assigned priority queue. Each priority queue is for queries to a specific name-server, ex: @1.1.1.1. If the priority queue is full, then the demultiplexer will block on both the Priority/Global queue. In this way, we prioritize sending work to the threads which have a pre-existing connection to the name server, but also avoid large work imbalance issues.

Similarly, threads will prefer to read from their Priority queue, before blocking on both Priority and Global queues.

Tradeoff: Work Imbalance vs. Connection Re-use

Every time a worker chooses a work item from the Global queue, it will have to re-handshake. However, without this Global/Priority queue split, workers with Priority on a very fast nameserver will sit idle when they could be doing work too. To showcase this, see the below experiment.

I ran an experiment to check different points on this spectrum:

As the blocking time increases, the odds that a non-Priority thread will need to take an item from the Global queue to load-balance decreases. This decreases new TCP handshakes but increases the runtime.

image

3x runs per condition 7,000 domains run with "A", "--verbosity=3", "--threads=100", "--tcp-only", "--name-servers=1.1.1.1,1.0.0.1,8.8.8.8,8.8.4.4",

Unaffected

--iterative lookups will ignore this suggestion and chose a random root server. Since we don't support --tls or --https with --iterative anyway, this isn't a concern.

Performance

Using the benchmark (7k domains), edited the command to use ./zdns A --name-servers=1.0.0.1,1.1.1.1,8.8.8.8,8.8.4.4 --threads=100 (external lookups)

Testing

phillip-stephens commented 1 month ago

After talking with Zakir, we can make this quite a bit simpler if we just re-use the existing connection in ExternalLookup if the user/CLI doesn't suggest a new one. Closing this but leaving branch in case we ever want to re-visit. #445 has the new approach