wsjdata / clinton-email-cruncher

Download Hillary Clinton's emails and query them with sqlite
MIT License
154 stars 31 forks source link

Feature request: integrate with the email processing tools from Mining the Social Web, Second Edition #1

Open znmeb opened 8 years ago

znmeb commented 8 years ago

I'm not a Pythonista so I don't know how difficult this is, but I'd like to see this integrated with the email processing tools from Mining the Social Web, Second Edition (https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition).

If I get some spare time (not likely given the three other projects I'm hacking on) I might try to integrate the SQLite database with the R "tm" corpus tools.

martinburch commented 8 years ago

So, specifically, Chapter 6: Mining Mailboxes? Since this corpus text is produced by OCR and humans, it won't be as easy to clean and parse as the Enron emails, which are properly-structured email messages with headers.

znmeb commented 8 years ago

@martinburch Just converting it to either of the two "standard" formats would suffice for me. One format is a directory with files and the other is one big file. They're convertible to each other in R and in Python.

martinburch commented 8 years ago

My immediate goal for this repo is to add additional metadata to the sqlite database. Extracting that database to a standard mailbox format is not my priority, but I welcome a pull request that provides output formats.