Top level nameserver failures should result in retries in --iterative or --retries

paul-pearce commented 7 years ago

Right now if a root server times out in --iterative mode the query fails without trying other roots. This is because the root servers were bolted onto factory.RandomNameServer. This behavior should change, but will require a fairly large restructuring of how we handle name servers. However we fix it, we should also have --retries > 1 try other nameservers (if they exist).

zakird commented 7 years ago

Why should this be different than the normal number of retries?

paul-pearce commented 7 years ago

This isn't an issue with the number of retries. It's issue that we do not rotate through the roots.

e.g., for iterative, we first randomly select a . root. If that fails, the entire process fails. Conversely, if our . query succeeds and we receive the .com authoritative, and the first .com authoritative fails, we will continue to retry different .com authoritative until timeout.

--retries has no impact on this, as it will try the same . root over and over.

zakird commented 7 years ago

Out of curiosity, why do some root resolvers stop responding?

paul-pearce commented 7 years ago

Great question. I don't know, but I observed it. I encountered this during one of my test runs when working on the recursion branch. One of the runs had a failure rate about 7% higher than expected. Upon investigation, I discovered that one of the roots was timing out. I manually poked it and it was, indeed, not responding to that measurement machine. It may have been a rate-limiting reaction, but I doubt it. The failures were immediate during that run, and I've not observed it before or since.

phillip-stephens commented 1 week ago

It looks like the current code has a similar behavior to the old code in this regard. --retries simply retries connecting to the same nameserver, I supposed assuming there was a transitory network issue in reaching that nameserver.

@zakird and @paul-pearce, do you think we should make this change for all levels, not just the root nameservers? Like if a.gtld-servers.net fails and we have retries left, should we choose another .com nameserver at random?

Additionally, I can imagine --retries being:

per-NS connection (as it is now)
per layer (--retries=3 means we can attempt to connect to 3 .com NS's before giving up)
per domain (--retries=3 means we can re-attempt 3 times during a domain's entire iterative lookup)

I don't have strong feelings, but I think per domain is the most easily understandable as a user. LMK your thoughts.

zakird commented 1 week ago

Yeah I definitely thinking trying others at every layer is the right call. I think retries could be a max number total for a given thing that you are trying to look up. Seems easiest to understand and consistent?

On Wed, Sep 11, 2024 at 4:59 PM Phillip Stephens @.***> wrote:

It looks like the current code has a similar behavior to the old code in this regard. --retries simply retries connecting to the same nameserver, I supposed assuming there was a transitory network issue in reaching that nameserver.

@zakird https://github.com/zakird and @paul-pearce https://github.com/paul-pearce, do you think we should make this change for all levels, not just the root nameservers? Like if a.gtld-servers.net fails and we have retries left, should we choose another .com nameserver at random?

Additionally, I can imagine --retries being:

per-NS connection (as it is now)

per layer (--retries=3 means we can attempt to connect to 3 .com NS's before giving up)

per domain (--retries=3 means we can re-attempt 3 times during a domain's entire iterative lookup)

I don't have strong feelings, but I think per domain is the most easily understandable as a user. LMK your thoughts.

— Reply to this email directly, view it on GitHub https://github.com/zmap/zdns/issues/93#issuecomment-2344691233, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREUAR3SYEHT6IZTKXH73ZWCVL5AVCNFSM4DFRRNQ2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZUGQ3DSMJSGMZQ . You are receiving this because you were mentioned.Message ID: @.***>

phillip-stephens commented 6 days ago

Yeah I agree, definitely easiest for the user to understand!

zmap / zdns

Top level nameserver failures should result in retries in --iterative or --retries #93