Open data-steve opened 8 years ago
Petitioners are expected to select AT LEAST ONE issue upt to three issues out of the 39 issue categories. Do each petition include at least one issue category? If so, then there should be no problem on this..What do you think? --Loni
Hi Loni,
As I mentioned to Yoni over Twitter, it is the structure of the missing data that suggests to me something is odd about the data generation process. After looking at the data structure more, it appears that the data were created in such a way that if there was only one tag applied, it filled in the column issues.id; but if there were multiple tags then it switched over to putting the data into the issues1-issues3 columns, leaving the issues.id column empty.
It would seem to have been easier to just put the in the issues1-issues3 columns no matter then number of tags. That would suggest my ifelse(is.na(issues1.id), issues.id, issues1.id) righted the odd data generation process. I just wanted to make sure that I was doing just to the good work of others that made this data available.
Thanks!
~ Steve
Sent via telepathy
On Apr 9, 2016, at 2:20 PM, lonihagen <notifications@github.com mailto:notifications@github.com> wrote:
Petitioners are expected to select AT LEAST ONE issue upt to three issues out of the 39 issue categories. Do each petition include at least one issue category? If so, then there should be no problem on this..What do you think? --Loni
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/yoni/r_we_the_people/issues/3#issuecomment-207825249
There's about 300+ petitions in the petition_analyses dataset that are missing issues in the issues1-issues3 columns, but have it in the issues.id and issues.name columns. Here's the first 5 rows from the data after loading it.
A story I can imagine to explain this behavior in the data is that those 300+ petitions weren't labeled by the petitioners when submitted but were labeled post-hoc by those running the website in order for every petition to have at least one label / tag.
also if issues.id were the first tag always filled in, why is it almost always NA? If it were alwasy the first one, then it should always be filled.
when I run this code (
p %>% select(id, issues1.id, issues.id) %>% filter(is.na(issues.id)|is.na(issues1.id))
) to look for NAs in either issues.id or issues1.id, they are always alternating which one has the NA, as shown in the few examples below: