Closed smishr closed 1 year ago
703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.
@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?
PS, I wouldn't label it high priority
since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.
Hi @smishr @ayushpatnaikgit can you please provide the code for making the dclus2
design? It's not available in the documentation.
Hi @smishr @ayushpatnaikgit can you please provide the code for making the
dclus2
design? It's not available in the documentation.
See runtests.jl, @itsdebartha had added apiclus2 code.
703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.
@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?
PS, I wouldn't label it
high priority
since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.
I'm able to reproduce this error. I also explicitly defined the weightedmean
function.
julia> using Survey, StatsBase;
julia> apiclus2 = load_data("apiclus2");
julia> apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
julia> dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw);
julia> dclus2_jk = dclus2 |> jackknifeweights;
julia> weightedmean(x, y) = mean(x, weights(y));
julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 703.81 23.4023
julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 677.444 23.7026
Tests for jackknife variance are matching analogous R code for srs and strat designs.
But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.
# clus2 with fpc (doesnt match Julia) dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) > svymean(~api00+api99,dclus2) mean SE api00 670.81 30.099 api99 645.03 29.711 > svymean(~api00+api99,dclus2_jk) mean SE api00 670.81 33.992 api99 645.03 33.717 # clus2 without fpc (doesnt match Julia) dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2) dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) > svymean(~api00+api99,dclus2) mean SE api00 670.81 30.712 api99 645.03 30.308 > svymean(~api00+api99,dclus2_jk) mean SE api00 670.81 34.928 api99 645.03 34.645
I even tried combinations by removing fpc2, snum, weights and without weights.
In julia i tried
dclus2_jk = dclus2 |> jackknifeweights mean([:api00,:api99],dclus2_jk) 2×3 DataFrame Row │ names estimator SE │ String Float64 Float64 ─────┼──────────────────────────── 1 │ api00 703.81 23.4023 2 │ api99 677.444 23.7026
Will try with other cluster samples as well. Not sure if it is
dclus2
specific or for all cluster samples
Also @smishr, isn't the R code doing something slightly different than the Julia code? For example in the cluster id argument we are passing ~dnum + snum
instead of just dnum
. Could this be the reason for the difference?
703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.
@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?
PS, I wouldn't label it
high priority
since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.I'm able to reproduce this error. I also explicitly defined the
weightedmean
function.julia> using Survey, StatsBase; julia> apiclus2 = load_data("apiclus2"); julia> apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),)); julia> dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw); julia> dclus2_jk = dclus2 |> jackknifeweights; julia> weightedmean(x, y) = mean(x, weights(y)); julia> jackknife_variance(:api00, weightedmean, dclus2_jk) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 703.81 23.4023 julia> jackknife_variance(:api99, weightedmean, dclus2_jk) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 677.444 23.7026
Tests for jackknife variance are matching analogous R code for srs and strat designs.
But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.
# clus2 with fpc (doesnt match Julia) dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) > svymean(~api00+api99,dclus2) mean SE api00 670.81 30.099 api99 645.03 29.711 > svymean(~api00+api99,dclus2_jk) mean SE api00 670.81 33.992 api99 645.03 33.717 # clus2 without fpc (doesnt match Julia) dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2) dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) > svymean(~api00+api99,dclus2) mean SE api00 670.81 30.712 api99 645.03 30.308 > svymean(~api00+api99,dclus2_jk) mean SE api00 670.81 34.928 api99 645.03 34.645
I even tried combinations by removing fpc2, snum, weights and without weights.
In julia i tried
dclus2_jk = dclus2 |> jackknifeweights mean([:api00,:api99],dclus2_jk) 2×3 DataFrame Row │ names estimator SE │ String Float64 Float64 ─────┼──────────────────────────── 1 │ api00 703.81 23.4023 2 │ api99 677.444 23.7026
Will try with other cluster samples as well. Not sure if it is
dclus2
specific or for all cluster samplesAlso @smishr, isn't the R code doing something slightly different than the Julia code? For example in the cluster id argument we are passing
~dnum + snum
instead of justdnum
. Could this be the reason for the difference?
yes its possible that R doing different calc to the jacknife we have implemented in Julia. But since the same algorithm matches estimates+variance for srs and strat, i dont know where the divergence from when clusters are present.
@codetalker7 you can also replicate above for apiclus1, see if same issue. You showed some minor differences on our Monday call? i am more confident that current Julia replicatedesigns works better for apiclus1 than 2
703.81 and 677.444 are the regular means instead of weighted mean. It seems like a small bug.
@codetalker7 can you run Shikhar's code and verify if you are getting the same bug?
PS, I wouldn't label it
high priority
since jackknife isn't merged into the main branch, if everything is high priority, then nothing is.I'm able to reproduce this error. I also explicitly defined the
weightedmean
function.julia> using Survey, StatsBase; julia> apiclus2 = load_data("apiclus2"); julia> apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),)); julia> dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw); julia> dclus2_jk = dclus2 |> jackknifeweights; julia> weightedmean(x, y) = mean(x, weights(y)); julia> jackknife_variance(:api00, weightedmean, dclus2_jk) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 703.81 23.4023 julia> jackknife_variance(:api99, weightedmean, dclus2_jk) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 677.444 23.7026
Tests for jackknife variance are matching analogous R code for srs and strat designs.
But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.
# clus2 with fpc (doesnt match Julia) dclus2 <- svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) > svymean(~api00+api99,dclus2) mean SE api00 670.81 30.099 api99 645.03 29.711 > svymean(~api00+api99,dclus2_jk) mean SE api00 670.81 33.992 api99 645.03 33.717 # clus2 without fpc (doesnt match Julia) dclus2 <- svydesign(id=~dnum+snum, weights = ~pw, data=apiclus2) dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) > svymean(~api00+api99,dclus2) mean SE api00 670.81 30.712 api99 645.03 30.308 > svymean(~api00+api99,dclus2_jk) mean SE api00 670.81 34.928 api99 645.03 34.645
I even tried combinations by removing fpc2, snum, weights and without weights.
In julia i tried
dclus2_jk = dclus2 |> jackknifeweights mean([:api00,:api99],dclus2_jk) 2×3 DataFrame Row │ names estimator SE │ String Float64 Float64 ─────┼──────────────────────────── 1 │ api00 703.81 23.4023 2 │ api99 677.444 23.7026
Will try with other cluster samples as well. Not sure if it is
dclus2
specific or for all cluster samplesAlso @smishr, isn't the R code doing something slightly different than the Julia code? For example in the cluster id argument we are passing
~dnum + snum
instead of justdnum
. Could this be the reason for the difference?yes its possible that R doing different calc to the jacknife we have implemented in Julia. But since the same algorithm matches estimates+variance for srs and strat, i dont know where the divergence from adding clusters comes from.
@codetalker7 you can also replicate above for apiclus1, see if similar issue. You showed some minor differences on our Monday call? i am more confident that current Julia replicatedesigns works better for apiclus1 than 2
For the point estimate. R isn't doing any magic. It's just weighted mean, instead of mean.
If all the weights are the same in srs and strat, it won't make a difference. Somewhere we are making a mistake and doing a regular mean.
@smishr did you do: Equivalent of:
apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
In R?
Can you share you full code in R and Julia? From top to bottom.
@smishr did you do: Equivalent of:
apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
In R?
Can you share you full code in R and Julia? From top to bottom.
" The sampling weights in apiclus1 are incorrect (the weight should be 757/15) but are as obtained from UCLA. " https://r-survey.r-forge.r-project.org/survey/html/api.html
Not sure if it's the case with apiclus2
as well, but we are doing it in Julia. What to make sure that we are doing the same thing in both R and Julia.
@smishr did you do: Equivalent of:
apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
In R?
Can you share you full code in R and Julia? From top to bottom.
this is only issue for apiclus1 not 2 afaik. Can we double check the SurveyDesign that is contructed in Julia with R? eyeball any differences.
Hi @ayushpatnaikgit @smishr I tried the same thing without this line:
apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
After this, the point estimates are matching. The variance however is matching up to only one decimal point.
julia> jackknife_variance(:api00, weightedmean, dclus2_jk)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 670.812 34.9388
julia> jackknife_variance(:api99, weightedmean, dclus2_jk)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 645.034 34.6565
Hi @ayushpatnaikgit @smishr I tried the same thing without this line:
apiclus2[!, :pw] = fill(757 / 15, (size(apiclus2, 1),));
After this, the point estimates are matching. The variance however is matching up to only one decimal point.
julia> jackknife_variance(:api00, weightedmean, dclus2_jk) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 670.812 34.9388 julia> jackknife_variance(:api99, weightedmean, dclus2_jk) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 645.034 34.6565
Great, that's what I expected.
Tests for jackknife variance are matching analogous R code for srs and strat designs.
But i tried two-stage cluster sample, and have been getting different point estimates, not to mention variance.
I even tried combinations by removing fpc2, snum, weights and without weights.
In julia i tried
Will try with other cluster samples as well. Not sure if it is
dclus2
specific or for all cluster samples