Closed simpsora closed 5 years ago
Hi Ross,
I was going to work on the deduplication since it would improve diff efficiency and accuracy as well as some people are using the output files with some other tools. Thanks a lot for helping me cross off this item from my list, very much appreciated.
cert_database.lookup()
uses a list for storing subdomains, but the respose from the service can contain (many) duplicate subdomains. In addition, the subdomain list is not in a deterministic order, so future diffs may be inaccurate.This PR switches from a list to a
set
for storing the subdomains, which automatically deduplicates them. It also returns a sorted version, guaranteeing a consistent order for every call.I tested this using the
python.org
domain, before and after applying the code in this PR:You can see there are a lot of duplicates for this domain.
paypal.com
was worse at 7550 before and 2031 after.This issue is very evident when using Slack, as the code makes individual requests to the Slack API for each subdomain; with a lot of subdomains this can take hours.