xKDR / Survey.jl

Analysis of complex surveys
https://xkdr.github.io/Survey.jl/
GNU General Public License v3.0
53 stars 19 forks source link

[WIP] add jackknife for unstratified and stratified #260

Closed smishr closed 1 year ago

smishr commented 1 year ago
codecov-commenter commented 1 year ago

Codecov Report

Merging #260 (1da3e8f) into v0.1.1 (e9362f9) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            v0.1.1      #260   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        13    +1     
  Lines          200       243   +43     
=========================================
+ Hits           200       243   +43     
Impacted Files Coverage Δ
src/jackknife.jl 100.00% <100.00%> (ø)
src/mean.jl 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

smishr commented 1 year ago

@ayushpatnaikgit can you look into my new simple algorithm for calculating jackknife weights columns? at line 63 of jackknife.jl.

Im having trouble finding indices of a given DataFrame which satisfy some conditions. If you can see it would be awesome.

codetalker7 commented 1 year ago

Can link this PR to #231?

codetalker7 commented 1 year ago

Also @smishr, what is the difference between the functions jk1weights and jackknifeweights? They seem to have the same signature.

smishr commented 1 year ago

Also @smishr, what is the difference between the functions jk1weights and jackknifeweights? They seem to have the same signature.

jk1 is the case when there is a single/no strata. simplifies algorithm and makes it faster by having it as separate.

the bigger function is when there are multiple strata (hence loop over them for each cluster)

codetalker7 commented 1 year ago

@smishr @ayushpatnaikgit I've added an implementation of jackknifeweights (and I've commented out everything else; once this is cleared will remove all of those commented lines). For example, we can now do this.

using Survey
apistrat = load_data("apistrat")
dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
jackknife_design = jackknifeweights(dstrat)

Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?

ayushpatnaikgit commented 1 year ago

@smishr @ayushpatnaikgit I've added an implementation of jackknifeweights (and I've commented out everything else; once this is cleared will remove all of those commented lines). For example, we can now do this.

using Survey
apistrat = load_data("apistrat")
dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
jackknife_design = jackknifeweights(dstrat)

Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?

You can test this against R. Use repdesign to make replicate weights using jkn. Check if the replicate weights generated by R and Julia are the same.

https://r-survey.r-forge.r-project.org/survey/html/as.svrepdesign.html

codetalker7 commented 1 year ago

@smishr @ayushpatnaikgit I've added an implementation of jackknifeweights (and I've commented out everything else; once this is cleared will remove all of those commented lines). For example, we can now do this.

using Survey
apistrat = load_data("apistrat")
dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
jackknife_design = jackknifeweights(dstrat)

Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?

You can test this against R. Use repdesign to make replicate weights using jkn. Check if the replicate weights generated by R and Julia are the same.

https://r-survey.r-forge.r-project.org/survey/html/as.svrepdesign.html

Are we going to include these tests in our test suite? Like using RCall?

smishr commented 1 year ago

Are we going to include these tests in our test suite? Like using RCall?

is it common to use rcall in Julia testing suite? i mean we effectively do it anyways. @ayushpatnaikgit

codetalker7 commented 1 year ago

Hi @smishr @ayushpatnaikgit I was trying to compare R's output to that of Julia. For example consider this Julia code:

using Survey
apiclus1 = load_data("apiclus1")
dclus1 = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw)
rclus1 = jackknifeweights(dclus1)
rclus1.data[!, :replicate_1]    # get the first replicate weights column

And the corresponding version of the R code.

library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1)
rclus1<-as.svrepdesign(dclus1, type="JK1", compress=FALSE)    # jackknife
rclus1$repweights[, 1]  # get first column of replicates

I'm getting different outputs for this; in both the output vectors, the first 11 entries are 0s. In Julia's version, all other entries are 36.2646 and in R, all other entries are 1.071. I also noticed that R's dclus1 object doesn't have any weights, instead it has probabilities. Is this the reason for the difference? Or am I getting something wrong here?

ayushpatnaikgit commented 1 year ago

Can you implement the variance formula and compare the variance that you get in Julia vs R?

We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

codetalker7 commented 1 year ago

Can you implement the variance formula and compare the variance that you get in Julia vs R?

We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

Can you implement the variance formula and compare the variance that you get in Julia vs R?

We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

Can you remind me what the variance formula is (if possible any references)?

Also, I tried debugging R's as.svrepdesign function to see how they're calculating the replicate weights (for the example code I mentioned above). They are calculating the replicate weights using this formula (see line 62 of https://github.com/cran/survey/blob/master/R/surveyrep.R):

repweights<-outer(psu, unq, "!=")*n/(n-1)

For reference, here psu is the vector of clusters, unq is the vector of unique clusters, and n is the number of unique clusters. The outer function calculates the outer product of two arrays.

I'm not able to make sense of this.

smishr commented 1 year ago

Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?

Great work! Im looking into it, it seems to working great.

Jackknife algorithm gives a deterministic variance for summary statistics (like mean or total). That means the Julia answer must match with R or other package answer.

Can you remind me what the variance formula is (if possible any references)?

See Eq 9.9 in Lohr Sampling Analysis book on how to combine the JK replicate columns into a variance for a given statistic.

I dont think it is possible to test the implementation structurally, since replicate columns may be stored in arbitrary order.

ayushpatnaikgit commented 1 year ago

Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

Can you remind me what the variance formula is (if possible any references)?

Also, I tried debugging R's as.svrepdesign function to see how they're calculating the replicate weights (for the example code I mentioned above). They are calculating the replicate weights using this formula (see line 62 of https://github.com/cran/survey/blob/master/R/surveyrep.R):

repweights<-outer(psu, unq, "!=")*n/(n-1)

For reference, here psu is the vector of clusters, unq is the vector of unique clusters, and n is the number of unique clusters. The outer function calculates the outer product of two arrays.

I'm not able to make sense of this.

Suppose you have an estimator, such as total. This accepts two arguments, values and weights and it returns a number. Instead of weights, you pass all the replicate weights, and you get a vector of numbers.

There is a formula in the Rao-Wu paper and the Lohr book to go from the formula to from those vector of numbers to the variance of total. In the case of bootstrap, this is just variance of the vector of numbers. In the case of JKn, it's a slightly different.

Don't try to read R code, it'll be very difficult. Just compare the answers.

smishr commented 1 year ago

@ayushpatnaikgit i tried integrating jackknifeweights with mean. It is absolutely horrendous and slow obviously, but I think it might functionally work. Can be simplified and sped up obviously. Basically trying to implement Eq 9.9 from Lohr pg 382.

codetalker7 commented 1 year ago

Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if strara != selected_strata. While R might be doing: make weight = 0 if strata != seleccted_strata

Can you remind me what the variance formula is (if possible any references)? Also, I tried debugging R's as.svrepdesign function to see how they're calculating the replicate weights (for the example code I mentioned above). They are calculating the replicate weights using this formula (see line 62 of https://github.com/cran/survey/blob/master/R/surveyrep.R):

repweights<-outer(psu, unq, "!=")*n/(n-1)

For reference, here psu is the vector of clusters, unq is the vector of unique clusters, and n is the number of unique clusters. The outer function calculates the outer product of two arrays. I'm not able to make sense of this.

Suppose you have an estimator, such as total. This accepts two arguments, values and weights and it returns a number. Instead of weights, you pass all the replicate weights, and you get a vector of numbers.

There is a formula in the Rao-Wu paper and the Lohr book to go from the formula to from those vector of numbers to the variance of total. In the case of bootstrap, this is just variance of the vector of numbers. In the case of JKn, it's a slightly different.

Don't try to read R code, it'll be very difficult. Just compare the answers.

Tried to use @smishr's implementation of mean to estimate the standard error for the column :api00. Here is the Julia output.

julia> mean(:api00, rclus1)
1×2 DataFrame
 Row │ mean     SE      
     │ Float64  Float64 
─────┼──────────────────
   1 │ 644.169      0.0

And here is the output from R.

> svymean(~api00, rclus1)
        mean     SE
api00 644.17 26.594
codetalker7 commented 1 year ago

@smishr @ayushpatnaikgit I've added an implementation of jackknnife_variance and also shortened code for jackknifeweights a bit (the code for computing nh values for each stratum was repetitive). Also, the variance value is now (nearly) matching with R. To put everything together, here is a complete example in Julia.

julia> using Survey, StatsBase;
julia> apiclus1 = load_data("apiclus1");
julia> dclus1 = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw);
julia> rclus1 = jackknifeweights(dclus1);
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, rclus1)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   644.169  26.5997

And the same example in R.

> library(survey);
> data(api);
> dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1);
> rclus1<-as.svrepdesign(dclus1, type="JK1", compress=FALSE);   # jackknife
> svymean(~api00, rclus1)
        mean     SE
api00 644.17 26.594
smishr commented 1 year ago

@smishr @ayushpatnaikgit I've added an implementation of jackknnife_variance and also shortened code for jackknifeweights a bit (the code for computing nh values for each stratum was repetitive). Also, the variance value is now (nearly) matching with R. To put everything together, here is a complete example in Julia.

julia> using Survey, StatsBase;
julia> apiclus1 = load_data("apiclus1");
julia> dclus1 = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw);
julia> rclus1 = jackknifeweights(dclus1);
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, rclus1)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   644.169  26.5997

And the same example in R.

> library(survey);
> data(api);
> dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1);
> rclus1<-as.svrepdesign(dclus1, type="JK1", compress=FALSE);   # jackknife
> svymean(~api00, rclus1)
        mean     SE
api00 644.17 26.594

thanks so much siddhant well have look

ayushpatnaikgit commented 1 year ago

Can you try this:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_surveyphreg_details31.htm

ayushpatnaikgit commented 1 year ago

Can you try this:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_surveyphreg_details31.htm

Actually, I don't think this'll make a difference.

smishr commented 1 year ago

ive cleaned up and started adding tests. The algorithm is correct with R for srs and strat data. Am checking with clusters and nhanes

codetalker7 commented 1 year ago

Added docstrings. Now only the remaining tests have to be added.

smishr commented 1 year ago

@codetalker7 @ayushpatnaikgit i added tests for clus2 and nhanes.

Apart from the issue that jackknife variance will only work for mean and total (not ratio), is there anything else to do in this PR before merge?

v0.1.1 is starting to diverge heavily from main, we should do new release asap after this PR for clean slate

smishr commented 1 year ago

@ayushpatnaikgit @codetalker7 windows latest failing for some weird reason?

codetalker7 commented 1 year ago

@ayushpatnaikgit @codetalker7 windows latest failing for some weird reason?

Reran the failed run and it is now passing.