Open znmeb opened 8 years ago
So, specifically, Chapter 6: Mining Mailboxes? Since this corpus text is produced by OCR and humans, it won't be as easy to clean and parse as the Enron emails, which are properly-structured email messages with headers.
@martinburch Just converting it to either of the two "standard" formats would suffice for me. One format is a directory with files and the other is one big file. They're convertible to each other in R and in Python.
My immediate goal for this repo is to add additional metadata to the sqlite database. Extracting that database to a standard mailbox format is not my priority, but I welcome a pull request that provides output formats.
I'm not a Pythonista so I don't know how difficult this is, but I'd like to see this integrated with the email processing tools from Mining the Social Web, Second Edition (https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition).
If I get some spare time (not likely given the three other projects I'm hacking on) I might try to integrate the SQLite database with the R "tm" corpus tools.