xKDR / Survey.jl

Analysis of complex surveys
https://xkdr.github.io/Survey.jl/
GNU General Public License v3.0
53 stars 19 forks source link

DataFrames-like API #4

Open nalimilan opened 2 years ago

nalimilan commented 2 years ago

It's really great that you aim to replicate features provided by the R survey package as it's really the reference in that domain. However, its API is quite ad hoc and forces users to learn a completely new syntax when moving from unweighted to weighted/survey analyses. Have you considered adopting a syntax based on DataFrames group_by and combine/select/transform? In R, the recent srvyr package does that by wrapping survey functions with a dplyr-like syntax.

In particular, svyby(:api00, [:cname, :meals], dclus1, svymean) could be written as combine(groupby(dclus1, [:cname, :meals]), :api00 => svymean). With DataFramesMeta, this could become @combine(groupby(dclus1, [:cname, :meals]), svymean(:api00)). One would also be able to compute the mean of all columns using combine(dclus1, All() .=> svymean).

In terms of implementation I saw that you already use combine under the hood so it shouldn't be problematic.

Cc: @bkamins

ayushpatnaikgit commented 2 years ago

I think this would be better and yes, it's precisely what is done under the hood. It will offer greater flexibility to the user. I've studied the srvyr package, and I've also I've identified many other places where the survey package in R is weak.

However, until 1.0, I want to hold back and just implement the survey package in Julia. The gains in speed are enough to bring users from R to Julia.

I wish to do develop the package in this manner because all researchers at xKDR, who work on survey data, use the survey package in R, and it'll be effortless for them to switch to the Julia version. Many other researchers outside xKDR, whom I know, also use the survey package in R. I think the package is at least 20 years old. It has gone into the knowledge base of many organizations.

I think a feature like:

combine(dclus1, All() .=> svymean)

will be very useful.

Also, What is there is a function that has multiple return values, for example, the fivenum function in R. I couldn't find the relevant syntax in the DataFrames documentation.

Do you know how I can correct the following?

combine(groupby(dclus1, [:cname, :meals]), :api00 => fivenum)

Here, fivenum returns multiple values.

bkamins commented 2 years ago

Is this what you want?

julia> fivenum(x) = [x, 2x, 3x]
fivenum (generic function with 1 method)

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> combine(df, :a => fivenum => AsTable)
3×3 DataFrame
 Row │ x1     x2     x3    
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2      3
   2 │     2      4      6
   3 │     3      6      9

julia> combine(df, :a => fivenum => [:col1, :col2, :col3])
3×3 DataFrame
 Row │ col1   col2   col3  
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2      3
   2 │     2      4      6
   3 │     3      6      9
ayushpatnaikgit commented 2 years ago

Many thanks, professor. I wanted it for a Grouped DataFrame, and it's working with some modification.

df = DataFrame(a = 4:7, b = ["Apple", "Orange", "Orange", "Apple"])

gdf = groupby(df, :b)

fivenum(x) = DataFrame(s = sum(x), s2 = 2*sum(x), s3 = 3*sum(x))

combine(gdf, :a => fivenum => AsTable)

I will implement this. Closing the issue.

bkamins commented 2 years ago

I wanted it for a GroupedDataFrame

All DataFrames.jl API is the same for data frame and GroupedDataFrame.

smishr commented 2 years ago

I suggest a piping syntax like this, instead of, or in addition to svyby, for getting mean heights by country for a design:

macro svypipe(design::AbstractSurveyDesign, args...)
    # Some definitions
end
@svypipe design |> groupby(:country) |> mean(:height)
nalimilan commented 2 years ago

There's no need for a special macro AFAICT. Chain.jl's @chain and other packages already works.

ayushpatnaikgit commented 2 years ago

The rest of the package doesn't do piping, so it may look odd if it's just in one place. I am open to having pipes everywhere.

@chain looks good, but mandatory begin ... end doesn't look nice. Maybe there is no way around it. Perhaps Lazy is more suitable.

using Lazy
import DataFrames.groupby
@> design.data groupby(:country) combine(:height => mean)

This is similar to Stata's API, and as someone coming from C, I really detest this.

iuliadmtru commented 2 years ago

It would indeed be better to use already existing functionality. So far the solution that I like most for this is Pipe.jl (I understand that pipes are hard to write on a German keyboard, but I think the block syntax is not really appropriate for these type of operations and piping is a lot neater-looking IMO). But for now I would say wait until we implement this feature (if we implement it) because, who knows, maybe soon we'll be able to use underscores as r-values. If this gets implemented in Julia, we might be able to do something like

(design, :country) |> groupby(_, _) |> mean(_, :height)

if we add support for AbstractSurveyDesign in groupby (and if we change svymean to mean, but that's a minor aspect in the context of this discussion). If this would be possible, it would be great! It looks nice, it is clear and concise, and it is using Base. We could also do the same thing for functions other than groupby.

ayushpatnaikgit commented 2 years ago

As a new user,

(design, :country) |> groupby(_, _) |> mean(_, :height)

this will be difficult to understand. Might as well do what Milan is suggesting, i.e.

@combine(groupby(design, :country), mean(:height))