wsjdata / clinton-email-cruncher

Download Hillary Clinton's emails and query them with sqlite
MIT License
153 stars 30 forks source link

Get and analyze Hillary Clinton's email

In response to a public records request, the U.S. State Department is releasing Hillary Clinton's email messages from her time as secretary of state. Every month, newly released messages are posted to foia.state.gov as PDFs, with some metadata.

This collection of tools automates downloading and helps analyze the messages. The Wall Steet Journal's interactive graphics team uses some of this code to power our Clinton inbox search interactive.

We welcome your pull requests and issue reports.

What's in the toolkit

How to get started

Clone the repo.

git clone https://github.com/wsjdata/clinton-email-cruncher.git
cd clinton-email-cruncher

Install virtualenv if necessary.

pip install virtualenv

Create a virtual environment. Python 2.7.9 is required, specifically for SSL (HTTPS) support. State Department's website requires HTTPS.

virtualenv -p /usr/bin/python2.7 virt-hrcemail
source virt-hrcemail/bin/activate

Install all the Python dependencies.

pip install -r requirements.txt

Then, run the shell script.

./run.sh

You will need wget to download the PDFs. (Mac OS X users can install it using homebrew.) Downloading the PDFs can take around 30 minutes. If you don't want to download the PDFs, run ./run.sh no-pdf-download.

Finally, load HRCEMAIL_names.csv into the hrcemail.sqlite database.

csvsql --db "sqlite:///hrcemail.sqlite" --insert --no-create --blanks --table name  HRCEMAIL_names.csv 

Let's do some analysis!

How many messages did everyone send and receive? Run this SQL query:

sql2csv --db "sqlite:///hrcemail.sqlite" --query 'SELECT commonName,sum(to_count) to_sum, sum(from_count) from_sum, sum(from_count+to_count) total_sum FROM (
SELECT toName.`commonName`,0 from_count, count(distinct docID) to_count
FROM document d
JOIN name toName ON toName.`originalName` = d.`to`
JOIN name fromName ON fromName.`originalName` = d.`from`
group by toName.`commonName`
UNION ALL
SELECT fromName.`commonName`,count(distinct docID) from_count, 0 to_count
FROM document d
JOIN name toName ON toName.`originalName` = d.`to`
JOIN name fromName ON fromName.`originalName` = d.`from`
group by fromName.`commonName`
) t GROUP BY commonName
ORDER BY total_sum DESC;' | head | csvlook

How you can help

Are there any names in the document table that are not resolved in the name table? Use this query to check:

SELECT d.originalName d,n.originalName n FROM (SELECT distinct `to` originalName
FROM document
UNION
SELECT distinct `from` originalName
FROM document) d
LEFT JOIN name n ON TRIM(d.originalName) LIKE n.originalName
WHERE n.originalName IS NULL;

Find anything that needs to be updated? Fix the name table, export to HRCEMAIL_names.csv, and make a pull request.

sqlite3 -header -csv hrcemail.sqlite "SELECT * FROM name ORDER BY commonName,originalName;" > HRCEMAIL_names.csv 

Future work