zkemail / zk-email-verify

Verify any text in any sent or received email, cryptographically and via only trusting the sending mailserver.
https://prove.email
MIT License
321 stars 64 forks source link

Build DKIM archive website #81

Closed Divide-By-0 closed 1 month ago

Divide-By-0 commented 1 year ago

Edit: This is now WIP at https://github.com/foolo/dkim-lookup!

DKIM is usually a nested DNS record. For instance, for replit, we can see here: https://easydmarc.com/tools/dkim-lookup?domain=replit.com that the DKIM is under the selector "google" and has the value:

Selector: google
Record value:
v=DKIM1; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAk6RNxaxuNyiPhlH6rlgMOXNTaffcVsK+3E6lK1x8c7MO0w7on9zmaiApGE/2hBWQqRpy6EmRdUf6MJH5TmwM++51W4xR0TmTd1JvsbBR/9yjpR++vOahVkrdh0xPaq1zghHYaqNgsOThivw8Hgd8xWQzPPDcw7T+czQS0/Xe/nijU0dVlQX/s+evJpxP7VV/FzlMQvknMj1bCqAgzUFa1mXMO/ZfzHirpGVcJ+h1fMYOIzU4iV3KUIn6i1mg3T+Kw41MFW04F/4nnIQKTTFNGuI+T+6Ss1M1VcjlAxlwYZCJPE0Iy3cOWRBWsgXFZWx2rATlEtkasmf1NFpJu1nATwIDAQAB

Scrape the alexa top 1M websites (and a list of 50 websites that we manually add) for their DKIM key every day, and archive all the answers in a simple UI where someone can just type in a website, and see all the past DKIM keys for it. Note that these DNS records change roughly daily, and we want all selectors, including non-Google ones. Looking for a simple frontend, as well as script that can be run daily without being ratelimited. I recommend hitting DNS directly.

One way to do this is in python, use something like pydig to query the data, store it in a postgreSQL database, and provide a fastapi webserver for browsing it. Approximately 400 non-compressible bytes per entry times 1M sites changing daily would be a max of 400MB per day of data (thanks npulido for the suggestion).

Eventually, include dynamic checking (i.e. for each site, store the gap between the last n checks, and check more often around the distribution of those times).

Divide-By-0 commented 7 months ago

yes! so we created this local only app: https://github.com/zkemail/selector-scraper that scrapes the selectors from the last 10,000 emails in your personal inbox then displays them in a very simple, very ugly list on a frontend turns out like most things, you can get like a good 20% of all websites with like only 40 selectors and all the rest are one-off so if we just slightly modified this script to add these selectors to a database with the date then we read from that database, we could have a historical registry

so then we just need a very simple, pretty client side website (i.e. one existing one is https://easydmarc.com/tools/dkim-lookup but we can do better) that also offers historical results and any matched selectors in the db for any searched domain (which we can get since all your emails are timestamped)

i ran this on one of my non-primary inboxes and got this list: selector_db_dump.txt

Divide-By-0 commented 7 months ago

Olof: instead of a database with dkim key(s) for each domain, we make a database with selectors for each domain, and then a website which fetches the selectors for a specific domain from the DB, and then gets the dkim keys on-the-fly with a dns lookup (that happens in the client's browser) to the domain of interest?

Well the database should store historical dkim keys, plus maybe a signature from the user uploading them -- and yes in real time, we can also get the latest one from local client DNS (as well as locally calculate the poseidon hash to compare to the onchain one). Unfortunately rn there isn't a great way to verify them except by trusting certain signatories for now.

foolo commented 7 months ago

Well the database should store historical dkim keys, plus maybe a signature from the user uploading them -- and yes in real time, we can also get the latest one from local client DNS (as well as locally calculate the poseidon hash to compare to the onchain one). Unfortunately rn there isn't a great way to verify them except by trusting certain signatories for now.

@Divide-By-0 In https://github.com/zkemail/selector-scraper we store the selectors into a sqlite db. Which of the following do we want?

  1. modify selector-scraper so that it stores selectors and fetches+stores DKIM keys, and also modify it to use postgreSQL instead of sqlite.
  2. create a new app that goes though all the selectors from the sqlite db, fecthes the DKIM keys online and puts them in a postgreSQL db, (together with info about selectors, dates ets)?
Divide-By-0 commented 7 months ago

Well this sqlite one was the quickest to put up, but yeah I'd recommend moving to postgresql generally.

If you keep the current app, we'd have to find a way to make the public scraper code to adapt to only have access perms to add records *with signatures), not direct db access, and have some basic ddos protection. I would say you should do whatever is easiest for you, I'm fine keeping it as the same site or as two seperate sites.

foolo commented 7 months ago

Well this sqlite one was the quickest to put up, but yeah I'd recommend moving to postgresql generally.

If you keep the current app, we'd have to find a way to make the public scraper code to adapt to only have access perms to add records *with signatures), not direct db access, and have some basic ddos protection. I would say you should do whatever is easiest for you, I'm fine keeping it as the same site or as two seperate sites.

@Divide-By-0 Ok, thanks! I'm creating a Next.js app which has a Vercel Postgres database. I also created an uploader script (yes, another script :) ) that reads domains+selectors from the emails.db sqlite3 file, then fetches DKIM records from the DNS server, and uploads everything to the Postgres server on Vercel, where the data can the be used by the end-user website. Right now this uploader script is connecting via a database connection, but later we can change so that we have an API route instead. We can also change the uploader script so that it reads domains+selectors from some common file format, and we can then write data scrapers for other email provider than gmail, as long as their output has that common format.

foolo commented 6 months ago

@Divide-By-0 I worked a bit more on this website. It's live on https://dkim-lookup.vercel.app/ and the code is here: https://github.com/foolo/dkim-lookup/tree/main/dkim-lookup-app

Current features are briefly:

Question:

Regarding "Scrape the alexa top 1M websites". We discussed this a while ago and I think we chose the email-inbox-scraping approach instead (?) for the reason that there is no direct way of knowing the selector names for a particular domain. Do we still want this feature in some form or another? For example we could loop the 1M-list and guess among the 25 most common selectors? Then there is also the problem that the user-facing domain is not always the same as the DNS domain for DKIM key lookup. E.g. example.com may use examplemail.com for DKIM verification, so we won't necessarily find anything if we search for selectors directly on the domains from the Alexa list.

foolo commented 6 months ago

@Divide-By-0 Another example: on the 1M-list we would find yahoo.com, but when we scrape emails, the from-address, and the DKIM domain is cc.yahoo-inc.com:

From: Yahoo <no-reply@cc.yahoo-inc.com>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cc.yahoo-inc.com; s=fz2048;
Divide-By-0 commented 1 month ago

Built at https://github.com/zkemail/archive.prove.email -- relevant discussion moved to issues there.