yoni / r_we_the_people

R package for working with the We The People petition data.
11 stars 6 forks source link

re petitions_analaysis dataset: rows with missing issues.id and issues.name #3

Open data-steve opened 8 years ago

data-steve commented 8 years ago

There's about 300+ petitions in the petition_analyses dataset that are missing issues in the issues1-issues3 columns, but have it in the issues.id and issues.name columns. Here's the first 5 rows from the data after loading it.

A story I can imagine to explain this behavior in the data is that those 300+ petitions weren't labeled by the petitioners when submitted but were labeled post-hoc by those running the website in order for every petition to have at least one label / tag.

 p %>% select(id, contains("issues")) %>% slice(1:5)
                        id issues1.id               issues1.name issues2.id      issues2.name issues3.id                   issues3.name issues.id issues.name
1 4e7b352b4bd5046c04000000       <NA>                       <NA>       <NA>              <NA>       <NA>                           <NA>        20 Environment
2 4e7b352b4bd5046c04000000       <NA>                       <NA>       <NA>              <NA>       <NA>                           <NA>        20 Environment
3 4e7b352b4bd5046c04000000       <NA>                       <NA>       <NA>              <NA>       <NA>                           <NA>        20 Environment
4 4e7b35898d8c37d975000000          4 Civil Rights and Liberties        193 Government Reform         28                   Human Rights      <NA>        <NA>
5 4e7b37f611fb9c1179000000         12                    Defense         21            Family        181 Veterans and Military Families      <NA>        <NA>

also if issues.id were the first tag always filled in, why is it almost always NA? If it were alwasy the first one, then it should always be filled.

when I run this code (p %>% select(id, issues1.id, issues.id) %>% filter(is.na(issues.id)|is.na(issues1.id)) ) to look for NAs in either issues.id or issues1.id, they are always alternating which one has the NA, as shown in the few examples below:

row    id                       issues1.id  issues.id
3278 5126b2e98cce3f2c44000004         25      <NA>
3279 5126b2e98cce3f2c44000004         25      <NA>
3280 5126c27b7043012736000016         12      <NA>
3281 5126c794ee140fca4300001d       <NA>         4
3282 5126c794ee140fca4300001d       <NA>         4
3283 5126f34f688938ce6e000010       <NA>        29
3284 5126f34f688938ce6e000010       <NA>        29
3285 5126f34f688938ce6e000010       <NA>        29
3286 5126f34f688938ce6e000010       <NA>        29
3287 5126f34f688938ce6e000010       <NA>        29
3288 5127826600e579b04000000e          3      <NA>
3289 512787126ce61c8913000017          8      <NA>
3290 512787126ce61c8913000017          8      <NA>
lonihagen commented 8 years ago

Petitioners are expected to select AT LEAST ONE issue upt to three issues out of the 39 issue categories. Do each petition include at least one issue category? If so, then there should be no problem on this..What do you think? --Loni

data-steve commented 8 years ago

Hi Loni,

As I mentioned to Yoni over Twitter, it is the structure of the missing data that suggests to me something is odd about the data generation process. After looking at the data structure more, it appears that the data were created in such a way that if there was only one tag applied, it filled in the column issues.id; but if there were multiple tags then it switched over to putting the data into the issues1-issues3 columns, leaving the issues.id column empty.

It would seem to have been easier to just put the in the issues1-issues3 columns no matter then number of tags. That would suggest my ifelse(is.na(issues1.id), issues.id, issues1.id) righted the odd data generation process. I just wanted to make sure that I was doing just to the good work of others that made this data available.

Thanks!

~ Steve

Sent via telepathy

On Apr 9, 2016, at 2:20 PM, lonihagen <notifications@github.com mailto:notifications@github.com> wrote:

Petitioners are expected to select AT LEAST ONE issue upt to three issues out of the 39 issue categories. Do each petition include at least one issue category? If so, then there should be no problem on this..What do you think? --Loni

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/yoni/r_we_the_people/issues/3#issuecomment-207825249