zachguo / TCoHOT

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)
http://hdl.handle.net/2142/73656
3 stars 5 forks source link

Parse and extract doc_id & publication_date from HTRC MARC 21 XML metadata #12

Closed zachguo closed 10 years ago

zachguo commented 10 years ago

Assigned to @bindai

The complete HTRC metadata can be retrieved from the API described here. (The "metadata specification" section may also be helpful when parsing the XML)

Please complete the xml2json_HTRC function in metadata_processing/xml2json.py to parse new XML metadata into JSON with MongoDB default format. Then please create a getDV_HTRC.py to dump JSON into MongoDB and extract dependent variable. Any accompanied new documentation should go here

Thanks!!!!!!

zachguo commented 10 years ago

script for downloading xml is added into repo, just git pull.

zachguo commented 10 years ago

Following are some basic statistics, kept for references. For aa, almost all extracted documents have 'date' information....only three documents don't have date...

> db.dv.find().count()
18718
> db.dv.find({$or:[{"date":""},{"date":{$exists:0}}]}).count();
0
> db.dv.find({"date":{$regex:/^\d/}}).count()
18715
> db.dv.find({"date":{$regex:/^\d{2}/}}).count()
18703
> db.dv.find({"date":{$regex:/^\d{3}/}}).count()
18636
> db.dv.find({"date":{$regex:/^\d{4}/}}).count()
18624
> db.dv.find({"date":{$regex:/^\D/}}).count()
3
> db.dv.find({"date":{$regex:/^\D/}})
{ "_id" : "uiuo.ark+=13960=t78s5123s", "date" : "uuuuuuuu" }
{ "_id" : "uiuo.ark+=13960=t85h7t84t", "date" : "uuuuuuuu" }
{ "_id" : "uiuo.ark+=13960=t46q27p0m", "date" : "uuuuuuuu" }
> db.dv.find({"date":{$regex:/^\d\D/}}).count()
12
> db.dv.find({"date":{$regex:/^\d\D/}})
{ "_id" : "mdp.39015066581995", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068158610", "date" : "1uuu9999" }
{ "_id" : "mdp.39015066660336", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068379182", "date" : "1uuu9999" }
{ "_id" : "yale.39002013858791", "date" : "1uuuuuuu" }
{ "_id" : "mdp.39015056713996", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068252454", "date" : "1uuu9999" }
{ "_id" : "mdp.39015065217096", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068252587", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068084105", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068252470", "date" : "1uuu9999" }
{ "_id" : "mdp.39015066000418", "date" : "1uuu9999" }
> db.dv.find({"date":{$regex:/^\d{2}\D/}}).count()
67
> db.dv.find({"date":{$regex:/^\d{2}\D/}}).limit(5)
{ "_id" : "mdp.39015068491748", "date" : "18uu9999" }
{ "_id" : "mdp.39015028302308", "date" : "19uu1975" }
{ "_id" : "mdp.39015021073930", "date" : "18uu9999" }
{ "_id" : "mdp.39015068187189", "date" : "18uu9999" }
{ "_id" : "mdp.39015068187197", "date" : "18uu9999" }
tedelblu commented 10 years ago

Do you think I should re-sample? I am wondering if two samples should be taken--one from volumes with dates, and one from volumes without or "invalid" dates. We could perform statistical analysis on them separately and/or concat the two samples and perform the analysis.

Thoughts?

Let me know. The data capacitor just came back online, so I have access to the data again.

On Tue, Mar 4, 2014 at 3:54 PM, Zach Guo notifications@github.com wrote:

Following are some basic statistics, kept for references. For aa, almost all extracted documents have 'date' information....only three documents don't have date...

db.dv.find().count() 18718 db.dv.find({$or:[{"date":""},{"date":{$exists:0}}]}).count(); 0 db.dv.find({"date":{$regex:/^\d/}}).count() 18715 db.dv.find({"date":{$regex:/^\d{2}/}}).count() 18703 db.dv.find({"date":{$regex:/^\d{3}/}}).count() 18636 db.dv.find({"date":{$regex:/^\d{4}/}}).count() 18624 db.dv.find({"date":{$regex:/^\D/}}).count() 3 db.dv.find({"date":{$regex:/^\D/}}) { "_id" : "uiuo.ark+=13960=t78s5123s", "date" : "uuuuuuuu" } { "_id" : "uiuo.ark+=13960=t85h7t84t", "date" : "uuuuuuuu" } { "_id" : "uiuo.ark+=13960=t46q27p0m", "date" : "uuuuuuuu" } db.dv.find({"date":{$regex:/^\d\D/}}).count() 12 db.dv.find({"date":{$regex:/^\d\D/}}) { "_id" : "mdp.39015066581995", "date" : "1uuu9999" } { "_id" : "mdp.39015068158610", "date" : "1uuu9999" } { "_id" : "mdp.39015066660336", "date" : "1uuu9999" } { "_id" : "mdp.39015068379182", "date" : "1uuu9999" } { "_id" : "yale.39002013858791", "date" : "1uuuuuuu" } { "_id" : "mdp.39015056713996", "date" : "1uuu9999" } { "_id" : "mdp.39015068252454", "date" : "1uuu9999" } { "_id" : "mdp.39015065217096", "date" : "1uuu9999" } { "_id" : "mdp.39015068252587", "date" : "1uuu9999" } { "_id" : "mdp.39015068084105", "date" : "1uuu9999" } { "_id" : "mdp.39015068252470", "date" : "1uuu9999" } { "_id" : "mdp.39015066000418", "date" : "1uuu9999" } db.dv.find({"date":{$regex:/^\d{2}\D/}}).count() 67 db.dv.find({"date":{$regex:/^\d{2}\D/}}).limit(5) { "_id" : "mdp.39015068491748", "date" : "18uu9999" } { "_id" : "mdp.39015028302308", "date" : "19uu1975" } { "_id" : "mdp.39015021073930", "date" : "18uu9999" } { "_id" : "mdp.39015068187189", "date" : "18uu9999" } { "_id" : "mdp.39015068187197", "date" : "18uu9999" }

Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/12#issuecomment-36673261 .

zachguo commented 10 years ago

@tedelblu

I am wondering if two samples should be taken--one from volumes with dates, and one from volumes without or "invalid" dates.

Yes, I think we first need divide volumes into 3 parts/files:

Then we'll split/sample first file (volumes with valid dates) into training and testing sets.

tedelblu commented 10 years ago

I went back through the vidsplit_aa collection, and here is what I find:

19,915 records 18,718 lang=eng 18,623 valid date or range 95 partial, invalid or missing date

I will post two new files based on this--one random sample of valid date/range records, and one random sample of invalid dates.

Trevor

On Tue, Mar 4, 2014 at 4:43 PM, Zach Guo notifications@github.com wrote:

@tedelblu https://github.com/tedelblu

I am wondering if two samples should be taken--one from volumes with dates, and one from volumes without or "invalid" dates.

Yes, I think we first need divide volumes into 3 parts/files:

  • ones with valid dates,
  • ones with partial dates(e.g. '18uu', '1uuu'),
  • ones without or with "invalid" dates.

Then we'll split first file (volumes with valid dates) into training and testing sets.

Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/12#issuecomment-36681817 .