zachguo / TCoHOT

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)
http://hdl.handle.net/2142/73656
3 stars 5 forks source link

Ted Underwood to speak at IU today! (4/7) #39

Closed tedelblu closed 10 years ago

tedelblu commented 10 years ago

All,

Ted Underwood will be presenting in Woodburn Hall 200 today at 4:00pm. I am going to try and go and see if he has time to answer a few questions about his time-based OCR error normalization project.

I encourage you to join me if you are available.

image

chenmiao commented 10 years ago

That's wonderful!

Miao

PallaviMurthy commented 10 years ago

Wow that's great! But I am working today till 6.00pm. Trevor let me know how it was if you get to attend.

chenmiao commented 10 years ago

Dear All,

In case you don't know, Ted Underwood's presentation will be broadcast online too.

Miao


Thank you for your interest. Yes, we plan to broadcast it live here:

http://www.indiana.edu/~video/stream/liveflash.html?filename=Catapult_Colloquium_Serieshttps://www.exchange.iu.edu/owa/redir.aspx?C=rS06Q4Pkyk2uNXBesuuM9B1KRK7_JdEIQVLGqHOqo9e4kb4ij3GOBgQnAVC8d5Hc4XjkcAMbfls.&URL=http%3a%2f%2fwww.indiana.edu%2f%7evideo%2fstream%2fliveflash.html%3ffilename%3dCatapult_Colloquium_Series

Also, we'll post a recording of it in a few days here:

http://www.indiana.edu/~catapult/colloquia_past.shtmlhttps://www.exchange.iu.edu/owa/redir.aspx?C=rS06Q4Pkyk2uNXBesuuM9B1KRK7_JdEIQVLGqHOqo9e4kb4ij3GOBgQnAVC8d5Hc4XjkcAMbfls.&URL=http%3a%2f%2fwww.indiana.edu%2f%7ecatapult%2fcolloquia_past.shtml

tedelblu commented 10 years ago

All,

Zach and I were fortunate enough to attend Ted's presentation. Here is a short summary:

Professor Underwood framed the discussion in terms of text mining and machine learning as new opportunities in research for humanists. He impressed that there is not merely the opportunity to collaborate with computer scientists and statisticians, but a real opportunity to introduce these research methods in the discipline.

He provided a very cursory overview of supervised and unsupervised learning, Bayesian statistics, and topic modeling. He demonstrated how topic modeling could reveal time-based clustering of topics using the full HathiTrust corpus (not the Sandbox). Unfortunately he did not go into the specifics of his methods, but he did mention how the research he presented here was based solely on the metadata (MARC xml).

I had an opportunity to speak with Ted after his talk. One of our team's questions had to do specifically with "genre" in the metadata. He answered this question for us by letting me know that he parsed this from data element 008.

`...

...` If we are interested in using "genre" as a feature, it looks like place 33 of element 008 represents the literary form. Unfortunately, there was a long line of folks wanting to ask questions, so I didn't get to ask about OCR error normalization, but he was kind enough to offer that I might email him any additional questions I had. This is a generous offer that we should keep in our back pocket if we find that we have time to explore additional features from the metadata. Related links: http://www.hathitrust.org/bib_specifications http://www.loc.gov/marc/bibliographic/ecbdlist.html http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/
zhhuo commented 10 years ago

One question about the confusion matrix: so the table will only show either true or false (Postive or negative) as a result, is it correct?

?

So what is the positive and what is the negative value when applying to our data?


From: tedelblu notifications@github.com Sent: Tuesday, April 8, 2014 7:54 AM To: zachguo/Z604-Project Subject: Re: [Z604-Project] Ted Underwood to speak at IU today! (4/7) (#39)

All,

Zach and I were fortunate enough to attend Ted's presentation. Here is a short summary:

Professor Underwood framed the discussion in terms of text mining and machine learning as new opportunities in research for humanists. He impressed that there is not merely the opportunity to collaborate with computer scientists and statisticians, but a real opportunity to introduce these research methods in the discipline.

He provided a very cursory overview of supervised and unsupervised learning, Bayesian statistics, and topic modeling. He demonstrated how topic modeling could reveal time-based clustering of topics using the full HathiTrust corpus (not the Sandbox). Unfortunately he did not go into the specifics of his methods, but he did mention how the research he presented here was based solely on the metadata (MARC xml).

I had an opportunity to speak with Ted after his talk. One of our team's questions had to do specifically with "genre" in the metadata. He answered this question for us by letting me know that he parsed this from data element 008.

... ...

If we are interested in using "genre" as a feature, it looks like place 33 of element 008 represents the literary form.

Unfortunately, there was a long line of folks wanting to ask questions, so I didn't get to ask about OCR error normalization, but he was kind enough to offer that I might email him any additional questions I had. This is a generous offer that we should keep in our back pocket if we find that we have time to explore additional features from the metadata.

Related links: http://www.hathitrust.org/bib_specifications http://www.loc.gov/marc/bibliographic/ecbdlist.html http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/

Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/39#issuecomment-39843738.

PallaviMurthy commented 10 years ago

Thank You Trevor. Thats great you got to speak to Ted.

P.S I am in a project meeting of my other course till 10pm (might take longer). I will work on Z604 after the meeting ends and update the slides etc.

On Tue, Apr 8, 2014 at 4:57 PM, zhhuo notifications@github.com wrote:

One question about the confusion matrix: so the table will only show either true or false (Postive or negative) as a result, is it correct?

?

So what is the positive and what is the negative value when applying to our data?


From: tedelblu notifications@github.com Sent: Tuesday, April 8, 2014 7:54 AM To: zachguo/Z604-Project Subject: Re: [Z604-Project] Ted Underwood to speak at IU today! (4/7) (#39)

All,

Zach and I were fortunate enough to attend Ted's presentation. Here is a short summary:

Professor Underwood framed the discussion in terms of text mining and machine learning as new opportunities in research for humanists. He impressed that there is not merely the opportunity to collaborate with computer scientists and statisticians, but a real opportunity to introduce these research methods in the discipline.

He provided a very cursory overview of supervised and unsupervised learning, Bayesian statistics, and topic modeling. He demonstrated how topic modeling could reveal time-based clustering of topics using the full HathiTrust corpus (not the Sandbox). Unfortunately he did not go into the specifics of his methods, but he did mention how the research he presented here was based solely on the metadata (MARC xml).

I had an opportunity to speak with Ted after his talk. One of our team's questions had to do specifically with "genre" in the metadata. He answered this question for us by letting me know that he parsed this from data element 008.

... ...

If we are interested in using "genre" as a feature, it looks like place 33 of element 008 represents the literary form.

Unfortunately, there was a long line of folks wanting to ask questions, so I didn't get to ask about OCR error normalization, but he was kind enough to offer that I might email him any additional questions I had. This is a generous offer that we should keep in our back pocket if we find that we have time to explore additional features from the metadata.

Related links: http://www.hathitrust.org/bib_specifications http://www.loc.gov/marc/bibliographic/ecbdlist.html

http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/

Reply to this email directly or view it on GitHub< https://github.com/zachguo/Z604-Project/issues/39#issuecomment-39843738>.

Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/39#issuecomment-39900899 .

Master of Information Science (MIS) School of Informatics and Computing