xKDR / Survey.jl

Analysis of complex surveys
https://xkdr.github.io/Survey.jl/
GNU General Public License v3.0
50 stars 19 forks source link

Jackknife estimates and variance not matching for dclus2 #288

Closed smishr closed 1 year ago

smishr commented 1 year ago

Tests for jackknife variance are matching analogous R code for srs and strat designs.

But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.

# clus2 with fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.099
api99 645.03 29.711
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 33.992
api99 645.03 33.717

# clus2 without fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.712
api99 645.03 30.308
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 34.928
api99 645.03 34.645

I even tried combinations by removing fpc2, snum, weights and without weights.

In julia i tried

dclus2_jk = dclus2 |> jackknifeweights
mean([:api00,:api99],dclus2_jk)
2×3 DataFrame
 Row │ names   estimator  SE      
     │ String  Float64    Float64 
─────┼────────────────────────────
   1 │ api00     703.81   23.4023
   2 │ api99     677.444  23.7026

Will try with other cluster samples as well. Not sure if it is dclus2 specific or for all cluster samples

ayushpatnaikgit commented 1 year ago

703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.

@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?

PS, I wouldn't label it high priority since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.

codetalker7 commented 1 year ago

Hi @smishr @ayushpatnaikgit can you please provide the code for making the dclus2 design? It's not available in the documentation.

smishr commented 1 year ago

Hi @smishr @ayushpatnaikgit can you please provide the code for making the dclus2 design? It's not available in the documentation.

See runtests.jl, @itsdebartha had added apiclus2 code.

codetalker7 commented 1 year ago

703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.

@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?

PS, I wouldn't label it high priority since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.

I'm able to reproduce this error. I also explicitly defined the weightedmean function.

julia> using Survey, StatsBase;
julia> apiclus2 = load_data("apiclus2");
julia> apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
julia> dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw);
julia> dclus2_jk = dclus2 |> jackknifeweights;
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │    703.81  23.4023

julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   677.444  23.7026 

Tests for jackknife variance are matching analogous R code for srs and strat designs.

But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.

# clus2 with fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.099
api99 645.03 29.711
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 33.992
api99 645.03 33.717

# clus2 without fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.712
api99 645.03 30.308
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 34.928
api99 645.03 34.645

I even tried combinations by removing fpc2, snum, weights and without weights.

In julia i tried

dclus2_jk = dclus2 |> jackknifeweights
mean([:api00,:api99],dclus2_jk)
2×3 DataFrame
 Row │ names   estimator  SE      
     │ String  Float64    Float64 
─────┼────────────────────────────
   1 │ api00     703.81   23.4023
   2 │ api99     677.444  23.7026

Will try with other cluster samples as well. Not sure if it is dclus2 specific or for all cluster samples

Also @smishr, isn't the R code doing something slightly different than the Julia code? For example in the cluster id argument we are passing ~dnum + snum instead of just dnum. Could this be the reason for the difference?

smishr commented 1 year ago

703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.

@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?

PS, I wouldn't label it high priority since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.

I'm able to reproduce this error. I also explicitly defined the weightedmean function.

julia> using Survey, StatsBase;
julia> apiclus2 = load_data("apiclus2");
julia> apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
julia> dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw);
julia> dclus2_jk = dclus2 |> jackknifeweights;
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │    703.81  23.4023

julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   677.444  23.7026 

Tests for jackknife variance are matching analogous R code for srs and strat designs.

But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.

# clus2 with fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.099
api99 645.03 29.711
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 33.992
api99 645.03 33.717

# clus2 without fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.712
api99 645.03 30.308
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 34.928
api99 645.03 34.645

I even tried combinations by removing fpc2, snum, weights and without weights.

In julia i tried

dclus2_jk = dclus2 |> jackknifeweights
mean([:api00,:api99],dclus2_jk)
2×3 DataFrame
 Row │ names   estimator  SE      
     │ String  Float64    Float64 
─────┼────────────────────────────
   1 │ api00     703.81   23.4023
   2 │ api99     677.444  23.7026

Will try with other cluster samples as well. Not sure if it is dclus2 specific or for all cluster samples

Also @smishr, isn't the R code doing something slightly different than the Julia code? For example in the cluster id argument we are passing ~dnum + snum instead of just dnum. Could this be the reason for the difference?

yes its possible that R doing different calc to the jacknife we have implemented in Julia. But since the same algorithm matches estimates+variance for srs and strat, i dont know where the divergence from when clusters are present.

@codetalker7 you can also replicate above for apiclus1, see if same issue. You showed some minor differences on our Monday call? i am more confident that current Julia replicatedesigns works better for apiclus1 than 2

smishr commented 1 year ago

703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.

@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?

PS, I wouldn't label it high priority since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.

I'm able to reproduce this error. I also explicitly defined the weightedmean function.

julia> using Survey, StatsBase;
julia> apiclus2 = load_data("apiclus2");
julia> apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
julia> dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw);
julia> dclus2_jk = dclus2 |> jackknifeweights;
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │    703.81  23.4023

julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   677.444  23.7026 

Tests for jackknife variance are matching analogous R code for srs and strat designs.

But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.

# clus2 with fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.099
api99 645.03 29.711
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 33.992
api99 645.03 33.717

# clus2 without fpc (doesnt match Julia)
dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2)
dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE)
> svymean(~api00+api99,dclus2)
        mean     SE
api00 670.81 30.712
api99 645.03 30.308
> svymean(~api00+api99,dclus2_jk)
        mean     SE
api00 670.81 34.928
api99 645.03 34.645

I even tried combinations by removing fpc2, snum, weights and without weights.

In julia i tried

dclus2_jk = dclus2 |> jackknifeweights
mean([:api00,:api99],dclus2_jk)
2×3 DataFrame
 Row │ names   estimator  SE      
     │ String  Float64    Float64 
─────┼────────────────────────────
   1 │ api00     703.81   23.4023
   2 │ api99     677.444  23.7026

Will try with other cluster samples as well. Not sure if it is dclus2 specific or for all cluster samples

Also @smishr, isn't the R code doing something slightly different than the Julia code? For example in the cluster id argument we are passing ~dnum + snum instead of just dnum. Could this be the reason for the difference?

yes its possible that R doing different calc to the jacknife we have implemented in Julia. But since the same algorithm matches estimates+variance for srs and strat, i dont know where the divergence from adding clusters comes from.

@codetalker7 you can also replicate above for apiclus1, see if similar issue. You showed some minor differences on our Monday call? i am more confident that current Julia replicatedesigns works better for apiclus1 than 2

ayushpatnaikgit commented 1 year ago

For the point estimate. R isn't doing any magic. It's just weighted mean, instead of mean.

If all the weights are the same in srs and strat, it won't make a difference. Somewhere we are making a mistake and doing a regular mean.

ayushpatnaikgit commented 1 year ago

@smishr did you do: Equivalent of:

apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));

In R?

Can you share you full code in R and Julia? From top to bottom.

ayushpatnaikgit commented 1 year ago

@smishr did you do: Equivalent of:

apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));

In R?

Can you share you full code in R and Julia? From top to bottom.

" The sampling weights in apiclus1 are incorrect (the weight should be 757/15) but are as obtained from UCLA. " https://r-survey.r-forge.r-project.org/survey/html/api.html

Not sure if it's the case with apiclus2 as well, but we are doing it in Julia. What to make sure that we are doing the same thing in both R and Julia.

smishr commented 1 year ago

@smishr did you do: Equivalent of:

apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));

In R?

Can you share you full code in R and Julia? From top to bottom.

this is only issue for apiclus1 not 2 afaik. Can we double check the SurveyDesign that is contructed in Julia with R? eyeball any differences.

codetalker7 commented 1 year ago

Hi @ayushpatnaikgit @smishr I tried the same thing without this line:

apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));

After this, the point estimates are matching. The variance however is matching up to only one decimal point.

julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   670.812  34.9388

julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   645.034  34.6565
ayushpatnaikgit commented 1 year ago

Hi @ayushpatnaikgit @smishr I tried the same thing without this line:

apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));

After this, the point estimates are matching. The variance however is matching up to only one decimal point.

julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   670.812  34.9388

julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
 Row │ estimator  SE      
     │ Float64    Float64 
─────┼────────────────────
   1 │   645.034  34.6565

Great, that's what I expected.