xmarquez / democracyData

Access and manipulate most standard scholarly measures of democracy
https://xmarquez.github.io/democracyData/
84 stars 21 forks source link

Duplicate country-years and missing values in Extended UDS #15

Open arcruz0 opened 1 year ago

arcruz0 commented 1 year ago

Thanks for the great package!

While using democracyData::extended_uds, I've encountered duplicate country-years and unexpected missigness, as shown below.

u <- democracyData::extended_uds

u[duplicated(u[, c("extended_country_name", "year")]),]
#> # A tibble: 3 × 20
#>   extended_country_name     GWn  cown in_GW_system  year     z1  se_z1 z1_pct975
#>   <chr>                   <dbl> <dbl> <lgl>        <dbl>  <dbl>  <dbl>     <dbl>
#> 1 German Federal Republic   260   260 FALSE         1945 -0.895 0.627      0.333
#> 2 German Federal Republic   260   260 TRUE          1990  1.76  0.262      2.27 
#> 3 Yemen (Arab Republic o…   678   679 TRUE          1990  0.163 0.0963     0.351
#> # ℹ 12 more variables: z1_pct025 <dbl>, z1_adj <dbl>, z1_pct975_adj <dbl>,
#> #   z1_pct025_adj <dbl>, z1_as_prob <dbl>, z1_pct975_as_prob <dbl>,
#> #   z1_pct025_as_prob <dbl>, z1_adj_as_prob <dbl>, z1_pct975_adj_as_prob <dbl>,
#> #   z1_pct025_adj_as_prob <dbl>, num_measures <int>, measures <list>

u[is.na(u$extended_country_name),]
#> # A tibble: 83 × 20
#>    extended_country_name   GWn  cown in_GW_system  year    z1 se_z1 z1_pct975
#>    <chr>                 <dbl> <dbl> <lgl>        <dbl> <dbl> <dbl>     <dbl>
#>  1 <NA>                     NA    NA FALSE         1789 -1.14 0.571   -0.0196
#>  2 <NA>                     NA    NA FALSE         1790 -1.14 0.571   -0.0196
#>  3 <NA>                     NA    NA FALSE         1791 -1.14 0.571   -0.0196
#>  4 <NA>                     NA    NA FALSE         1792 -1.14 0.571   -0.0196
#>  5 <NA>                     NA    NA FALSE         1793 -1.14 0.571   -0.0196
#>  6 <NA>                     NA    NA FALSE         1794 -1.14 0.571   -0.0196
#>  7 <NA>                     NA    NA FALSE         1795 -1.14 0.571   -0.0196
#>  8 <NA>                     NA    NA FALSE         1796 -1.14 0.571   -0.0196
#>  9 <NA>                     NA    NA FALSE         1797 -1.14 0.571   -0.0196
#> 10 <NA>                     NA    NA FALSE         1798 -1.14 0.571   -0.0196
#> # ℹ 73 more rows
#> # ℹ 12 more variables: z1_pct025 <dbl>, z1_adj <dbl>, z1_pct975_adj <dbl>,
#> #   z1_pct025_adj <dbl>, z1_as_prob <dbl>, z1_pct975_as_prob <dbl>,
#> #   z1_pct025_as_prob <dbl>, z1_adj_as_prob <dbl>, z1_pct975_adj_as_prob <dbl>,
#> #   z1_pct025_adj_as_prob <dbl>, num_measures <int>, measures <list>

Created on 2023-07-27 with reprex v2.0.2

Please let me know if there is any other information I can provide, and thanks in advance!

xmarquez commented 1 year ago

Hi Andrés,

Thanks for pointing this out - I'll take a good look tomorrow. I can tell the issue with East Germany and North Yemen is probably code disagreements that end up mattering when I join everything up into one dataset - one of those "East Germany" should really be unified Germany, and one of those North Yemen should really be unified Yemen. But that particular problem might not be solvable - it's just datasets disagreeing about when measurement should start/stop.

The NA countries are likely a couple of historical states from vdem I didn't incorporate into the tables for country_year_coder - that might take a bit longer to figure out how to do without breaking things. Sorry about that! I'll take a look tomorrow properly. In the meantime, you might want to take a look at how to generate your own version of the extended_uds data to check where the error might be coming from: https://xmarquez.github.io/democracyData/articles/Replicating_and_extending_the_UD_scores.html