zachguo / TCoHOT

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)
http://hdl.handle.net/2142/73656
3 stars 5 forks source link

Temporal Classification of HathiTrust OCRed Texts

Paper published in iConference 2015 Proceedings http://hdl.handle.net/2142/73656

This is also a course project for Z604 (Big Data Analytics for Web and Text) in 2014 Spring, taught by Xiaozhong Liu and Miao Chen.

Abstract

In large-scale digital libraries, it is not uncommon that some bibliographic fields in metadata records are incomplete or missing. Adding to the incomplete or missing metadata can greatly facilitate users' search and access to digital library resources. Temporal information, such as publication date, is a key descriptor of digital resources. In this study, we investigate text mining methods to automatically resolve missing publication dates for the HathiTrust corpora, a large collection of documents digitized by optical character recognition (OCR). In comparison with previous approaches using only unigrams as features, our experiment results show that methods incorporating higher order n-gram features, e.g., bigrams and trigrams, can more effectively classify a document into discrete temporal intervals or "chronons". Our approach can be generalized to classify volumes within other digital libraries.

Team