xKDR / Survey.jl

Analysis of complex surveys
https://xkdr.github.io/Survey.jl/
GNU General Public License v3.0
53 stars 19 forks source link

Testing using other survey datasets #72

Open smishr opened 1 year ago

smishr commented 1 year ago

So far, most of the testing suite is limited to the API dataset. I suggest to improve testing by using other publicly available survey datasets. R Lumley survey textbook examples could be used, (pg 7 section 1.2.1) eg. NHANES, SHS, SIPP.

http://asdfree.com - Analyse Surveys Free has many real-world datasets and examples with respective R survey code.

smishr commented 1 year ago

@iuliadmtru We think (a smaller and older version of) the Scottish Household Survey is a great candidate for testing with the singledesign branch.

Detailed info and data scripts as well as downloads are available on this really old website.

In the Lumley Survey textbook, you will also find multiple examples of R design and code for the older version of SHS in Chapter 6, figure 6.2 onwards, pg 110-130.

smishr commented 1 year ago

The old PEAS exemplars has 6 surveys full with R code, that are reasonably 'small' for modern computers to be able to be analysed locally without too much hassle. Tests and designs can be translated from the code and explanation given here.

smishr commented 1 year ago

After having a deeper look, I think we should export all of those surveys RData files that are linked in the websites, and add them into Julia assets/ folder. They arent very big, only few KB at most, and about 5-10 thousand obesrvations with weights, cluster and strata.

iuliadmtru commented 1 year ago

PR #166 adds more datasets to use for testing. We should remove all the datasets within assets/ that we are not using and will not use for testing.

I added the datasets you mentioned, apart from the last two. Those are not clustered nor stratified. I think we have enough datasets now and we should focus on testing. @ayushpatnaikgit I will start testing right after you push the latest version of bootweights.

smishr commented 1 year ago

Firstly, should we wget and download these datasets, or ship them part of the package?

Can we check the licenses of those datasets, and whether they are GPLv3 or similar and hence can be distributed with Survey.jl?