yrosseel / lavaan

an R package for structural equation modeling and more
http://lavaan.org
429 stars 98 forks source link

Pairwise likelihood cluster robust standard errors #340

Closed haziqj closed 5 months ago

haziqj commented 5 months ago

These changes will allow for robust cluster SE using PML estimation for ordinal variables. Changes made:

  1. Default options for PML & categorical will be "robust.cluster" (not "robust.cluster.sem") [lav_options.R]
  2. Turn off lav_msg_stop when PML & categorical [lav_lavaan_stop02_options.R]
  3. Modify the J matrix (i.e. crossproduct of scores / information H1) to account for clusters [lav_model_h1_information.R]

These modifications will adjust the SE to account for a reduced effective sample size due to clustering. This was highlighted by a reviewer: Consider the case where a sample of 30 is obtained using SRS from population of size 100. Each unit's selection probability is 3/10. On the other hand, consider a population of size 100 clustered into 10 groups of 10, and sample 3 clusters via SRS. The selection probability is yet again 3/10. Without adjusting the J matrix properly the weights alone cannot account for the clustering effect. See Binder (1983), Asparouhov (2005), and Walter (2007).

Note: The cluster option can be specified with or without sampling.weights. The reason to use weights might be unequal probability sampling of clusters.

References

haziqj commented 5 months ago

For posterity, a reprex:

library(lavaan)
#> This is lavaan 0.6-18.2057
#> lavaan is FREE software! Please report any bugs.
set.seed(1234)
LETTERS702 <- c(LETTERS, sapply(LETTERS, function(x) paste0(x, LETTERS)))

# Generate a clustered population of 5 binary items
Nclust <- 500
clust_sizes <- sample(10:30, Nclust, replace = TRUE)
N <- sum(clust_sizes)
clust <- LETTERS702[rep(1:Nclust, clust_sizes)]

# Underlying variable approach
pop.model <- '
eta1 =~ 0.8 * y1 + 0.7 * y2 + 0.47 * y3 + 0.38 * y4 + 0.34 * y5
'
tau <- c(1.43, 0.55, 0.13, 0.72, 1.13)
ystar <- simulateData(pop.model, sample.nobs = N, standardized = TRUE)

# Generate binary data based on thresholds
Pop <- 1 * (ystar > matrix(tau, nrow = N, ncol = 5, byrow = TRUE))
colnames(Pop) <- paste0("y", 1:5)
Pop <- lapply(as.data.frame(Pop), \(x) ordered(x, levels = c(0, 1)))
Pop$clust <- clust
Pop <- as.data.frame(Pop)

# Perform a clustered sample (nc = 50)
samp_prob <- clust_sizes / sum(clust_sizes)  # PPS
names(samp_prob) <- LETTERS702[1:Nclust]
clust_samp <- sample(unique(clust), 50, prob = samp_prob)
Data <- Pop[Pop$clust %in% clust_samp, ]
Data$wt <- 1 / samp_prob[Data$clust]

# Fit model
mod <- '
eta1 =~ y1 + y2 + y3 + y4 + y5
'
fit <- sem(model = mod, data = Data, estimator = "PML", std.lv = TRUE,
           sampling.weights = "wt", cluster = "clust")
summary(fit)
#> lavaan 0.6.18.2057 ended normally after 20 iterations
#> 
#>   Estimator                                        PML
#>   Optimization method                           NLMINB
#>   Number of model parameters                        10
#> 
#>   Number of observations                          1020
#>   Number of clusters [clust]                        50
#>   Sampling weights variable                         wt
#> 
#> Model Test User Model:
#>                                               Standard      Scaled
#>   Test Statistic                                14.279      14.050
#>   Degrees of freedom                                 5       3.813
#>   P-value (Unknown)                                 NA       0.006
#>   Scaling correction factor                                  1.016
#>     mean+var adjusted correction (PML)                            
#> 
#> Parameter Estimates:
#> 
#>   Parameterization                               Delta
#>   Standard errors                        Robust.cluster
#>   Information                                  Observed
#>   Observed information based on                 Hessian
#> 
#> Latent Variables:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>   eta1 =~                                             
#>     y1                0.695    0.070    9.919    0.000
#>     y2                0.741    0.064   11.510    0.000
#>     y3                0.448    0.058    7.757    0.000
#>     y4                0.450    0.061    7.344    0.000
#>     y5                0.306    0.061    5.044    0.000
#> 
#> Thresholds:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>     y1|t1             1.379    0.058   23.804    0.000
#>     y2|t1             0.488    0.040   12.167    0.000
#>     y3|t1             0.123    0.034    3.601    0.000
#>     y4|t1             0.751    0.039   19.093    0.000
#>     y5|t1             1.133    0.049   22.908    0.000
#> 
#> Variances:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>    .y1                0.517                           
#>    .y2                0.450                           
#>    .y3                0.799                           
#>    .y4                0.798                           
#>    .y5                0.906                           
#>     eta1              1.000

Created on 2024-04-26 with reprex v2.0.2

yrosseel commented 5 months ago

Great! Thanks a lot. Merged now.

haziqj commented 5 months ago

Thanks Yves!