Closed smishr closed 1 year ago
Merging #260 (1da3e8f) into v0.1.1 (e9362f9) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## v0.1.1 #260 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 12 13 +1
Lines 200 243 +43
=========================================
+ Hits 200 243 +43
Impacted Files | Coverage Δ | |
---|---|---|
src/jackknife.jl | 100.00% <100.00%> (ø) |
|
src/mean.jl | 100.00% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@ayushpatnaikgit can you look into my new simple algorithm for calculating jackknife weights columns? at line 63 of jackknife.jl
.
Im having trouble finding indices of a given DataFrame which satisfy some conditions. If you can see it would be awesome.
Can link this PR to #231?
Also @smishr, what is the difference between the functions jk1weights
and jackknifeweights
? They seem to have the same signature.
Also @smishr, what is the difference between the functions
jk1weights
andjackknifeweights
? They seem to have the same signature.
jk1 is the case when there is a single/no strata. simplifies algorithm and makes it faster by having it as separate.
the bigger function is when there are multiple strata (hence loop over them for each cluster)
@smishr @ayushpatnaikgit I've added an implementation of jackknifeweights
(and I've commented out everything else; once this is cleared will remove all of those commented lines). For example, we can now do this.
using Survey
apistrat = load_data("apistrat")
dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
jackknife_design = jackknifeweights(dstrat)
Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?
@smishr @ayushpatnaikgit I've added an implementation of
jackknifeweights
(and I've commented out everything else; once this is cleared will remove all of those commented lines). For example, we can now do this.using Survey apistrat = load_data("apistrat") dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw) jackknife_design = jackknifeweights(dstrat)
Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?
You can test this against R. Use repdesign
to make replicate weights using jkn. Check if the replicate weights generated by R and Julia are the same.
https://r-survey.r-forge.r-project.org/survey/html/as.svrepdesign.html
@smishr @ayushpatnaikgit I've added an implementation of
jackknifeweights
(and I've commented out everything else; once this is cleared will remove all of those commented lines). For example, we can now do this.using Survey apistrat = load_data("apistrat") dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw) jackknife_design = jackknifeweights(dstrat)
Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?
You can test this against R. Use
repdesign
to make replicate weights using jkn. Check if the replicate weights generated by R and Julia are the same.https://r-survey.r-forge.r-project.org/survey/html/as.svrepdesign.html
Are we going to include these tests in our test suite? Like using RCall
?
Are we going to include these tests in our test suite? Like using
RCall
?
is it common to use rcall in Julia testing suite? i mean we effectively do it anyways. @ayushpatnaikgit
Hi @smishr @ayushpatnaikgit I was trying to compare R's output to that of Julia. For example consider this Julia code:
using Survey
apiclus1 = load_data("apiclus1")
dclus1 = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw)
rclus1 = jackknifeweights(dclus1)
rclus1.data[!, :replicate_1] # get the first replicate weights column
And the corresponding version of the R code.
library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1)
rclus1<-as.svrepdesign(dclus1, type="JK1", compress=FALSE) # jackknife
rclus1$repweights[, 1] # get first column of replicates
I'm getting different outputs for this; in both the output vectors, the first 11 entries are 0s. In Julia's version, all other entries are 36.2646 and in R, all other entries are 1.071. I also noticed that R's dclus1
object doesn't have any weights, instead it has probabilities. Is this the reason for the difference? Or am I getting something wrong here?
Can you implement the variance formula and compare the variance that you get in Julia vs R?
We we are doing is: keep weight unchanged if strara != selected_strata
. While R might be doing: make weight = 0 if strata != seleccted_strata
Can you implement the variance formula and compare the variance that you get in Julia vs R?
We we are doing is: keep weight unchanged if
strara != selected_strata
. While R might be doing: make weight = 0 ifstrata != seleccted_strata
Can you implement the variance formula and compare the variance that you get in Julia vs R?
We we are doing is: keep weight unchanged if
strara != selected_strata
. While R might be doing: make weight = 0 ifstrata != seleccted_strata
Can you remind me what the variance formula is (if possible any references)?
Also, I tried debugging R's as.svrepdesign
function to see how they're calculating the replicate weights (for the example code I mentioned above). They are calculating the replicate weights using this formula (see line 62 of https://github.com/cran/survey/blob/master/R/surveyrep.R):
repweights<-outer(psu, unq, "!=")*n/(n-1)
For reference, here psu
is the vector of clusters, unq
is the vector of unique clusters, and n
is the number of unique clusters. The outer
function calculates the outer
product of two arrays.
I'm not able to make sense of this.
Can you please have a look? I'll soon add documentation. Also, question: what is a good way to test this?
Great work! Im looking into it, it seems to working great.
Jackknife algorithm gives a deterministic variance for summary statistics (like mean or total). That means the Julia answer must match with R or other package answer.
Can you remind me what the variance formula is (if possible any references)?
See Eq 9.9 in Lohr Sampling Analysis book on how to combine the JK replicate columns into a variance for a given statistic.
I dont think it is possible to test the implementation structurally, since replicate columns may be stored in arbitrary order.
Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if
strara != selected_strata
. While R might be doing: make weight = 0 ifstrata != seleccted_strata
Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if
strara != selected_strata
. While R might be doing: make weight = 0 ifstrata != seleccted_strata
Can you remind me what the variance formula is (if possible any references)?
Also, I tried debugging R's
as.svrepdesign
function to see how they're calculating the replicate weights (for the example code I mentioned above). They are calculating the replicate weights using this formula (see line 62 of https://github.com/cran/survey/blob/master/R/surveyrep.R):repweights<-outer(psu, unq, "!=")*n/(n-1)
For reference, here
psu
is the vector of clusters,unq
is the vector of unique clusters, andn
is the number of unique clusters. Theouter
function calculates theouter
product of two arrays.I'm not able to make sense of this.
Suppose you have an estimator, such as total
. This accepts two arguments, values
and weights
and it returns a number. Instead of weights
, you pass all the replicate weights, and you get a vector of numbers.
There is a formula in the Rao-Wu paper and the Lohr book to go from the formula to from those vector of numbers to the variance of total
. In the case of bootstrap, this is just variance of the vector of numbers. In the case of JKn, it's a slightly different.
Don't try to read R code, it'll be very difficult. Just compare the answers.
@ayushpatnaikgit i tried integrating jackknifeweights with mean
. It is absolutely horrendous and slow obviously, but I think it might functionally work. Can be simplified and sped up obviously. Basically trying to implement Eq 9.9 from Lohr pg 382.
Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if
strara != selected_strata
. While R might be doing: make weight = 0 ifstrata != seleccted_strata
Can you implement the variance formula and compare the variance that you get in Julia vs R? We we are doing is: keep weight unchanged if
strara != selected_strata
. While R might be doing: make weight = 0 ifstrata != seleccted_strata
Can you remind me what the variance formula is (if possible any references)? Also, I tried debugging R's
as.svrepdesign
function to see how they're calculating the replicate weights (for the example code I mentioned above). They are calculating the replicate weights using this formula (see line 62 of https://github.com/cran/survey/blob/master/R/surveyrep.R):repweights<-outer(psu, unq, "!=")*n/(n-1)
For reference, here
psu
is the vector of clusters,unq
is the vector of unique clusters, andn
is the number of unique clusters. Theouter
function calculates theouter
product of two arrays. I'm not able to make sense of this.Suppose you have an estimator, such as
total
. This accepts two arguments,values
andweights
and it returns a number. Instead ofweights
, you pass all the replicate weights, and you get a vector of numbers.There is a formula in the Rao-Wu paper and the Lohr book to go from the formula to from those vector of numbers to the variance of
total
. In the case of bootstrap, this is just variance of the vector of numbers. In the case of JKn, it's a slightly different.Don't try to read R code, it'll be very difficult. Just compare the answers.
Tried to use @smishr's implementation of mean
to estimate the standard error for the column :api00
. Here is the Julia output.
julia> mean(:api00, rclus1)
1×2 DataFrame
Row │ mean SE
│ Float64 Float64
─────┼──────────────────
1 │ 644.169 0.0
And here is the output from R.
> svymean(~api00, rclus1)
mean SE
api00 644.17 26.594
@smishr @ayushpatnaikgit I've added an implementation of jackknnife_variance
and also shortened code for jackknifeweights
a bit (the code for computing nh
values for each stratum was repetitive). Also, the variance value is now (nearly) matching with R. To put everything together, here is a complete example in Julia.
julia> using Survey, StatsBase;
julia> apiclus1 = load_data("apiclus1");
julia> dclus1 = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw);
julia> rclus1 = jackknifeweights(dclus1);
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, rclus1)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 644.169 26.5997
And the same example in R.
> library(survey);
> data(api);
> dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1);
> rclus1<-as.svrepdesign(dclus1, type="JK1", compress=FALSE); # jackknife
> svymean(~api00, rclus1)
mean SE
api00 644.17 26.594
@smishr @ayushpatnaikgit I've added an implementation of
jackknnife_variance
and also shortened code forjackknifeweights
a bit (the code for computingnh
values for each stratum was repetitive). Also, the variance value is now (nearly) matching with R. To put everything together, here is a complete example in Julia.julia> using Survey, StatsBase; julia> apiclus1 = load_data("apiclus1"); julia> dclus1 = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw); julia> rclus1 = jackknifeweights(dclus1); julia> weightedmean(x, y) = mean(x, weights(y)); julia> jackknife_variance(:api00, weightedmean, rclus1) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 644.169 26.5997
And the same example in R.
> library(survey); > data(api); > dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1); > rclus1<-as.svrepdesign(dclus1, type="JK1", compress=FALSE); # jackknife > svymean(~api00, rclus1) mean SE api00 644.17 26.594
thanks so much siddhant well have look
Can you try this:
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_surveyphreg_details31.htm
Actually, I don't think this'll make a difference.
ive cleaned up and started adding tests. The algorithm is correct with R for srs
and strat
data. Am checking with clusters and nhanes
Added docstrings. Now only the remaining tests have to be added.
@codetalker7 @ayushpatnaikgit i added tests for clus2 and nhanes.
Apart from the issue that jackknife variance will only work for mean and total (not ratio), is there anything else to do in this PR before merge?
v0.1.1 is starting to diverge heavily from main, we should do new release asap after this PR for clean slate
@ayushpatnaikgit @codetalker7 windows latest failing for some weird reason?
@ayushpatnaikgit @codetalker7 windows latest failing for some weird reason?
Reran the failed run and it is now passing.
mean