Closed zachguo closed 10 years ago
Hi Zach,
Can you please clarify what we want here? Are we wanting the date range for the entire xml corpus (exclude non-english, invalid dates, missing dates) or should I just focus on the vidsplit_aa volumes?
We can currently focus on valid dates of English document in aa
.
We don't want one date range has too many documents(e.g. Let's imagine a fake extreme case that 1850-1950 contains 99% of all documents. The model can get appox.99% accuracy by just classifying all testing data into this range), or one date range has too few documents(so few training data that model can never get it right). But I'm not sure whether it's necessary to make each date range contain a same number of documents.
There may also be historical reason(e.g. world war I, civil war).
I posted the date distribution of the trending data and a visual plot to the processed data section of the wiki. I will upload the .sh and .r code I used too. The distribution is not ideal (~90% of texts fall within only a 100-year period), but I think we are okay to proceed.
Thanks!
The distribution looks really weird. It looks like a normal distribution being cut at ~1925. So probably we cannot simply use aa
as the training set.
Could you please make another plot for the whole metadata?
Yes, I can do that, but I probably won't get to it until later tomorrow night. I hope we see something different, and that the aa split is not representative of the corpus.
On Thu, Mar 6, 2014 at 10:22 PM, Zach Guo notifications@github.com wrote:
Thanks! The distribution looks really weird. It looks like a normal distribution being cut at ~1925. So probably we cannot simply use aa as the training set. Could you please make another plot for the whole metadata?
Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/16#issuecomment-36964606 .
No problem, take your time.
If the distribution of dates from the whole corpus is acceptable, we may have to sample from the whole corpus and create our own vid_split
file.
And we need categorical date ranges (e.g. pre17thC, 1910-1930, etc.)
The distribution of all dates is quite similar to the one of aa.
all
:
aa
:
So we can continue using aa
part as our working sample. But the distribution makes it difficult to divide those dates into reasonable categorical ranges.
Hi Zach,
Thanks for doing this. This is a little disconcerting because it may nullify our research question. Given that all of our documents fall withing a very short date range (1800-1900), our ranges for are going to need to be quite small, i.e. decades for 1800-1900. Also, the statistical likelihood that a date will fall within date-range characteristics of the corpus may be higher than the reliability of the topic model to predict the date range.
I really hope this isn't the case, but I think we need to make this discussion a priority when we meet tomorrow morning. In the meantime, I will try and tackle defining a date range.
On Fri, Mar 7, 2014 at 11:27 PM, Zach Guo notifications@github.com wrote:
The distribution of all dates is quite similar to the one of aa.
all: [image: alldatefreq_plot]https://f.cloud.github.com/assets/3478203/2364504/f8fb5908-a678-11e3-974a-8d2380ce3604.png aa: [image: date_range_plot]https://f.cloud.github.com/assets/3478203/2364519/b81873ac-a679-11e3-9977-a1c6008bec1a.jpg
So we can continue using aa part as our working sample. But the distribution makes it difficult to divide those dates into reasonable categorical ranges.
Reply to this email directly or view it on GitHubhttps://github.com/zachguo/Z604-Project/issues/16#issuecomment-37088829 .
Posted date frequency ranges (.txt & .png) to wiki processed data.
After completing #12 , date categories for dependent variable should be determined, according to the distribution of dates and historical knowledge.