xpmethod / opensyllabus

Other
48 stars 10 forks source link

citation detection against citeseer / DPCP #40

Open denten opened 10 years ago

denten commented 10 years ago

citation-detection Search for each citation in citeseer DB (20mil) through SOLR to find a list of potential documents that contain citation.

karaganis commented 10 years ago

If this works, run against other title libraries (Worldcat?)

Joe Karaganis

The American Assembly Columbia University

On Sat, Jun 7, 2014 at 6:42 PM, Dennis Tenen notifications@github.com wrote:

Search for each citation in citeseer DB (20mil) through SOLR to find a list of potential documents that contain citation.

— Reply to this email directly or view it on GitHub https://github.com/dhcolumbia/opensyllabus/issues/40.

grahamsack commented 10 years ago

I was going to suggest something similar. I think for many syllabi given the paucity of citation info, our only means of identifying books is going to be checking possible titles and authors against a larger database like worldcat or library of congress and then seeing if the subject headings of the corresponding books seem plausible given the discipline of the syllabus.

Sent from my iPhone

On Jun 7, 2014, at 7:22 PM, karaganis notifications@github.com wrote:

If this works, run against other title libraries (Worldcat?)

Joe Karaganis

The American Assembly Columbia University

On Sat, Jun 7, 2014 at 6:42 PM, Dennis Tenen notifications@github.com wrote:

Search for each citation in citeseer DB (20mil) through SOLR to find a list of potential documents that contain citation.

— Reply to this email directly or view it on GitHub https://github.com/dhcolumbia/opensyllabus/issues/40.

— Reply to this email directly or view it on GitHub.

denten commented 10 years ago

Let's start with easier fields like "discipline" and "university". I am assigning this to .3 milestone, to be started in September.

cosmicBboy commented 10 years ago

One avenue I've been exploring is looking at Bold and Italic style tags in the html format.

_Leonard_Women_Music_syllabus.html in the /extractorresearch/extractors/output folder

Grabbing elements with 'Italic' style value yields this:

[The Source, Women & Music, Chicago Manual of Style, Women in Music, Vision, Women Making Music, Jougleresses, Trobairitz, Women Making Music , Slingshot Hip Hop, Jericho's Echo: Punk Rock in the Holy Land, Lady Sings the Blues, Girls Rock]

Subsequent contain author and required chapter information.

Searching for Bold yields this:

[Course Objectives, Email, Academic Code of Conduct, Required Text/Materials, Style Manual, Technology Requirements, Attendance, Students with Disabilities, Assignment Policies, Assignments and Evaluation, Participation:20%, Blogging, 15%, Omeka exhibit items: 25%, Final Project: 40%, Elements of the final project, Classroom Etiquette, Course Schedule]