webis-de / acl20-crawling-mailing-lists

2 stars 2 forks source link

Email Message Processing and Analysis

Research code for processing and analysing email and newsgroup messages.

The Webis-Gmane-19 email corpus was published at ACL 2020:

@InProceedings{stein:2020o,
  author =              {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein},
  booktitle =           {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
  month =               jul,
  publisher =           {Association for Computational Linguistics},
  site =                {Seattle, USA},
  title =               {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}},
  year =                2020
}

The corpus itself can be found on Zenodo.

Quickstart

Install dependencies via:

pip3 install -r requirements.txt

The run.sh script can be used to start any of the tools and services from the src directory with the correct PYTHONPATH.

Train and Evaluate Model

Train model:

./run.sh src/parsing/message_segmenter.py train fasttext-model.bin \
    annotations/annotations-final-train.jsonl out/segmentation-model

Evaluate model:

./run.sh src/parsing/message_segmenter.py evaluate \
    trained-model.h5 fasttext-model.bin annotations/annotations-final-validation.jsonl

Pre-trained Fasttext and Tensorflow models can be found at files.webis.de

Corpus Explorer

A web UI for data exploration can be found in src/explorer/explorer.py. Before starting it, copy the main config file src/conf/settings.py to src/conf/local_settings.py and adjust the config values (e.g. set the correct model paths etc.)

The corpus explorer can be started using the run.sh script as follows:

./run.sh explorer [flask-options]

Note: the corpus explorer assumes you have indexed the Webis-Gmane-19 corpus to Elasticsearch.

Other Tools in src

All command line tools in src can be started as follows:

./run.sh FILENAME

For individual usage instructions, run

./run.sh FILENAME --help

The following tools are available:

All indexing scripts need a valid Elasticsearch configuration. See the Corpus Explorer section for details.