zachguo / TCoHOT

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)
http://hdl.handle.net/2142/73656
3 stars 5 forks source link

Determine date ranges #16

Closed zachguo closed 10 years ago

zachguo commented 10 years ago

After completing #12 , date categories for dependent variable should be determined, according to the distribution of dates and historical knowledge.

tedelblu commented 10 years ago

Hi Zach,

Can you please clarify what we want here? Are we wanting the date range for the entire xml corpus (exclude non-english, invalid dates, missing dates) or should I just focus on the vidsplit_aa volumes?

zachguo commented 10 years ago

We can currently focus on valid dates of English document in aa. We don't want one date range has too many documents(e.g. Let's imagine a fake extreme case that 1850-1950 contains 99% of all documents. The model can get appox.99% accuracy by just classifying all testing data into this range), or one date range has too few documents(so few training data that model can never get it right). But I'm not sure whether it's necessary to make each date range contain a same number of documents. There may also be historical reason(e.g. world war I, civil war).

tedelblu commented 10 years ago

I posted the date distribution of the trending data and a visual plot to the processed data section of the wiki. I will upload the .sh and .r code I used too. The distribution is not ideal (~90% of texts fall within only a 100-year period), but I think we are okay to proceed.

zachguo commented 10 years ago

Thanks! The distribution looks really weird. It looks like a normal distribution being cut at ~1925. So probably we cannot simply use aa as the training set. Could you please make another plot for the whole metadata?

tedelblu commented 10 years ago

Yes, I can do that, but I probably won't get to it until later tomorrow night. I hope we see something different, and that the aa split is not representative of the corpus.

On Thu, Mar 6, 2014 at 10:22 PM, Zach Guo notifications@github.com wrote:

Thanks! The distribution looks really weird. It looks like a normal distribution being cut at ~1925. So probably we cannot simply use aa as the training set. Could you please make another plot for the whole metadata?

Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/16#issuecomment-36964606 .

zachguo commented 10 years ago

No problem, take your time. If the distribution of dates from the whole corpus is acceptable, we may have to sample from the whole corpus and create our own vid_split file.

zachguo commented 10 years ago

And we need categorical date ranges (e.g. pre17thC, 1910-1930, etc.)

zachguo commented 10 years ago

The distribution of all dates is quite similar to the one of aa.

all: alldatefreq_plot aa: date_range_plot

So we can continue using aa part as our working sample. But the distribution makes it difficult to divide those dates into reasonable categorical ranges.

tedelblu commented 10 years ago

Hi Zach,

Thanks for doing this. This is a little disconcerting because it may nullify our research question. Given that all of our documents fall withing a very short date range (1800-1900), our ranges for are going to need to be quite small, i.e. decades for 1800-1900. Also, the statistical likelihood that a date will fall within date-range characteristics of the corpus may be higher than the reliability of the topic model to predict the date range.

I really hope this isn't the case, but I think we need to make this discussion a priority when we meet tomorrow morning. In the meantime, I will try and tackle defining a date range.

On Fri, Mar 7, 2014 at 11:27 PM, Zach Guo notifications@github.com wrote:

The distribution of all dates is quite similar to the one of aa.

all: [image: alldatefreq_plot]https://f.cloud.github.com/assets/3478203/2364504/f8fb5908-a678-11e3-974a-8d2380ce3604.png aa: [image: date_range_plot]https://f.cloud.github.com/assets/3478203/2364519/b81873ac-a679-11e3-9977-a1c6008bec1a.jpg

So we can continue using aa part as our working sample. But the distribution makes it difficult to divide those dates into reasonable categorical ranges.

Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/16#issuecomment-37088829 .

tedelblu commented 10 years ago

Posted date frequency ranges (.txt & .png) to wiki processed data.