Closed zachguo closed 10 years ago
script for downloading xml is added into repo, just git pull
.
Following are some basic statistics, kept for references. For aa
, almost all extracted documents have 'date' information....only three documents don't have date...
> db.dv.find().count()
18718
> db.dv.find({$or:[{"date":""},{"date":{$exists:0}}]}).count();
0
> db.dv.find({"date":{$regex:/^\d/}}).count()
18715
> db.dv.find({"date":{$regex:/^\d{2}/}}).count()
18703
> db.dv.find({"date":{$regex:/^\d{3}/}}).count()
18636
> db.dv.find({"date":{$regex:/^\d{4}/}}).count()
18624
> db.dv.find({"date":{$regex:/^\D/}}).count()
3
> db.dv.find({"date":{$regex:/^\D/}})
{ "_id" : "uiuo.ark+=13960=t78s5123s", "date" : "uuuuuuuu" }
{ "_id" : "uiuo.ark+=13960=t85h7t84t", "date" : "uuuuuuuu" }
{ "_id" : "uiuo.ark+=13960=t46q27p0m", "date" : "uuuuuuuu" }
> db.dv.find({"date":{$regex:/^\d\D/}}).count()
12
> db.dv.find({"date":{$regex:/^\d\D/}})
{ "_id" : "mdp.39015066581995", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068158610", "date" : "1uuu9999" }
{ "_id" : "mdp.39015066660336", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068379182", "date" : "1uuu9999" }
{ "_id" : "yale.39002013858791", "date" : "1uuuuuuu" }
{ "_id" : "mdp.39015056713996", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068252454", "date" : "1uuu9999" }
{ "_id" : "mdp.39015065217096", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068252587", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068084105", "date" : "1uuu9999" }
{ "_id" : "mdp.39015068252470", "date" : "1uuu9999" }
{ "_id" : "mdp.39015066000418", "date" : "1uuu9999" }
> db.dv.find({"date":{$regex:/^\d{2}\D/}}).count()
67
> db.dv.find({"date":{$regex:/^\d{2}\D/}}).limit(5)
{ "_id" : "mdp.39015068491748", "date" : "18uu9999" }
{ "_id" : "mdp.39015028302308", "date" : "19uu1975" }
{ "_id" : "mdp.39015021073930", "date" : "18uu9999" }
{ "_id" : "mdp.39015068187189", "date" : "18uu9999" }
{ "_id" : "mdp.39015068187197", "date" : "18uu9999" }
Do you think I should re-sample? I am wondering if two samples should be taken--one from volumes with dates, and one from volumes without or "invalid" dates. We could perform statistical analysis on them separately and/or concat the two samples and perform the analysis.
Thoughts?
Let me know. The data capacitor just came back online, so I have access to the data again.
On Tue, Mar 4, 2014 at 3:54 PM, Zach Guo notifications@github.com wrote:
Following are some basic statistics, kept for references. For aa, almost all extracted documents have 'date' information....only three documents don't have date...
db.dv.find().count() 18718 db.dv.find({$or:[{"date":""},{"date":{$exists:0}}]}).count(); 0 db.dv.find({"date":{$regex:/^\d/}}).count() 18715 db.dv.find({"date":{$regex:/^\d{2}/}}).count() 18703 db.dv.find({"date":{$regex:/^\d{3}/}}).count() 18636 db.dv.find({"date":{$regex:/^\d{4}/}}).count() 18624 db.dv.find({"date":{$regex:/^\D/}}).count() 3 db.dv.find({"date":{$regex:/^\D/}}) { "_id" : "uiuo.ark+=13960=t78s5123s", "date" : "uuuuuuuu" } { "_id" : "uiuo.ark+=13960=t85h7t84t", "date" : "uuuuuuuu" } { "_id" : "uiuo.ark+=13960=t46q27p0m", "date" : "uuuuuuuu" } db.dv.find({"date":{$regex:/^\d\D/}}).count() 12 db.dv.find({"date":{$regex:/^\d\D/}}) { "_id" : "mdp.39015066581995", "date" : "1uuu9999" } { "_id" : "mdp.39015068158610", "date" : "1uuu9999" } { "_id" : "mdp.39015066660336", "date" : "1uuu9999" } { "_id" : "mdp.39015068379182", "date" : "1uuu9999" } { "_id" : "yale.39002013858791", "date" : "1uuuuuuu" } { "_id" : "mdp.39015056713996", "date" : "1uuu9999" } { "_id" : "mdp.39015068252454", "date" : "1uuu9999" } { "_id" : "mdp.39015065217096", "date" : "1uuu9999" } { "_id" : "mdp.39015068252587", "date" : "1uuu9999" } { "_id" : "mdp.39015068084105", "date" : "1uuu9999" } { "_id" : "mdp.39015068252470", "date" : "1uuu9999" } { "_id" : "mdp.39015066000418", "date" : "1uuu9999" } db.dv.find({"date":{$regex:/^\d{2}\D/}}).count() 67 db.dv.find({"date":{$regex:/^\d{2}\D/}}).limit(5) { "_id" : "mdp.39015068491748", "date" : "18uu9999" } { "_id" : "mdp.39015028302308", "date" : "19uu1975" } { "_id" : "mdp.39015021073930", "date" : "18uu9999" } { "_id" : "mdp.39015068187189", "date" : "18uu9999" } { "_id" : "mdp.39015068187197", "date" : "18uu9999" }
Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/12#issuecomment-36673261 .
@tedelblu
I am wondering if two samples should be taken--one from volumes with dates, and one from volumes without or "invalid" dates.
Yes, I think we first need divide volumes into 3 parts/files:
Then we'll split/sample first file (volumes with valid dates) into training and testing sets.
I went back through the vidsplit_aa collection, and here is what I find:
19,915 records 18,718 lang=eng 18,623 valid date or range 95 partial, invalid or missing date
I will post two new files based on this--one random sample of valid date/range records, and one random sample of invalid dates.
Trevor
On Tue, Mar 4, 2014 at 4:43 PM, Zach Guo notifications@github.com wrote:
@tedelblu https://github.com/tedelblu
I am wondering if two samples should be taken--one from volumes with dates, and one from volumes without or "invalid" dates.
Yes, I think we first need divide volumes into 3 parts/files:
- ones with valid dates,
- ones with partial dates(e.g. '18uu', '1uuu'),
- ones without or with "invalid" dates.
Then we'll split first file (volumes with valid dates) into training and testing sets.
Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/12#issuecomment-36681817 .
Assigned to @bindai
The complete HTRC metadata can be retrieved from the API described here. (The "metadata specification" section may also be helpful when parsing the XML)
Please complete the
xml2json_HTRC
function inmetadata_processing/xml2json.py
to parse new XML metadata into JSON with MongoDB default format. Then please create agetDV_HTRC.py
to dump JSON into MongoDB and extract dependent variable. Any accompanied new documentation should go hereThanks!!!!!!