zachguo / TCoHOT

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)
http://hdl.handle.net/2142/73656
3 stars 5 forks source link

March 5th Digital Library Brown Bag HTRC #9

Closed tedelblu closed 10 years ago

tedelblu commented 10 years ago

Greetings! Please join us this coming Wednesday, March 5, 2014 for a Digital Library Brown Bag Series presentation by Miao Chen. The presentation will be held at the Herman B Wells Library, room E174, from 12:00-1:00 EST. The presentation will also be broadcasted (see details below).

HathiTrust Research Center: Challenges and Opportunities in Big Text Data Miao Chen, Research Associate Data to Insight Center

HathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain [1] and the rest are in-copyrighted content.

The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need. In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools.

More about HTRC

The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details.

[1] http://www.hathitrust.org/statistics_visualizations

Presentations will also be broadcast via Adobe Connect. Go to http://breeze.iu.edu/diglib to view and listen to the presentation. If you are not a registered user for Connect Meeting/Breeze, select the "Enter as a Guest" option.

You can also follow and contribute to the presentation and discussion on twitter: #dlbb.

bindai commented 10 years ago

Cool! Thanks for sharing. I think the professor will let us out early to attend that presentation.

Best, Bin