Pairwise likelihood cluster robust standard errors

haziqj commented 5 months ago

These changes will allow for robust cluster SE using PML estimation for ordinal variables. Changes made:

Default options for PML & categorical will be "robust.cluster" (not "robust.cluster.sem") [lav_options.R]
Turn off lav_msg_stop when PML & categorical [lav_lavaan_stop02_options.R]
Modify the J matrix (i.e. crossproduct of scores / information H1) to account for clusters [lav_model_h1_information.R]

These modifications will adjust the SE to account for a reduced effective sample size due to clustering. This was highlighted by a reviewer: Consider the case where a sample of 30 is obtained using SRS from population of size 100. Each unit's selection probability is 3/10. On the other hand, consider a population of size 100 clustered into 10 groups of 10, and sample 3 clusters via SRS. The selection probability is yet again 3/10. Without adjusting the J matrix properly the weights alone cannot account for the clustering effect. See Binder (1983), Asparouhov (2005), and Walter (2007).

Note: The cluster option can be specified with or without sampling.weights. The reason to use weights might be unequal probability sampling of clusters.

References

Binder, David A. ‘On the Variances of Asymptotically Normal Estimators from Complex Surveys’. International Statistical Review / Revue Internationale de Statistique 51, no. 3 (1983): 279–92. https://doi.org/10.2307/1402588.
Asparouhov, Tihomir. ‘Sampling Weights in Latent Variable Modeling’. Structural Equation Modeling: A Multidisciplinary Journal 12, no. 3 (July 2005): 411–34. https://doi.org/10.1207/s15328007sem1203_4.
Wolter, Kirk M. Introduction to Variance Estimation, 2nd Ed. Introduction to Variance Estimation, 2nd Ed. New York, NY, US: Springer Science + Business Media, 2007.

haziqj commented 5 months ago

For posterity, a reprex:

library(lavaan)
#> This is lavaan 0.6-18.2057
#> lavaan is FREE software! Please report any bugs.
set.seed(1234)
LETTERS702 <- c(LETTERS, sapply(LETTERS, function(x) paste0(x, LETTERS)))

# Generate a clustered population of 5 binary items
Nclust <- 500
clust_sizes <- sample(10:30, Nclust, replace = TRUE)
N <- sum(clust_sizes)
clust <- LETTERS702[rep(1:Nclust, clust_sizes)]

# Underlying variable approach
pop.model <- '
eta1 =~ 0.8 * y1 + 0.7 * y2 + 0.47 * y3 + 0.38 * y4 + 0.34 * y5
'
tau <- c(1.43, 0.55, 0.13, 0.72, 1.13)
ystar <- simulateData(pop.model, sample.nobs = N, standardized = TRUE)

# Generate binary data based on thresholds
Pop <- 1 * (ystar > matrix(tau, nrow = N, ncol = 5, byrow = TRUE))
colnames(Pop) <- paste0("y", 1:5)
Pop <- lapply(as.data.frame(Pop), \(x) ordered(x, levels = c(0, 1)))
Pop$clust <- clust
Pop <- as.data.frame(Pop)

# Perform a clustered sample (nc = 50)
samp_prob <- clust_sizes / sum(clust_sizes)  # PPS
names(samp_prob) <- LETTERS702[1:Nclust]
clust_samp <- sample(unique(clust), 50, prob = samp_prob)
Data <- Pop[Pop$clust %in% clust_samp, ]
Data$wt <- 1 / samp_prob[Data$clust]

# Fit model
mod <- '
eta1 =~ y1 + y2 + y3 + y4 + y5
'
fit <- sem(model = mod, data = Data, estimator = "PML", std.lv = TRUE,
           sampling.weights = "wt", cluster = "clust")
summary(fit)
#> lavaan 0.6.18.2057 ended normally after 20 iterations
#> 
#>   Estimator                                        PML
#>   Optimization method                           NLMINB
#>   Number of model parameters                        10
#> 
#>   Number of observations                          1020
#>   Number of clusters [clust]                        50
#>   Sampling weights variable                         wt
#> 
#> Model Test User Model:
#>                                               Standard      Scaled
#>   Test Statistic                                14.279      14.050
#>   Degrees of freedom                                 5       3.813
#>   P-value (Unknown)                                 NA       0.006
#>   Scaling correction factor                                  1.016
#>     mean+var adjusted correction (PML)                            
#> 
#> Parameter Estimates:
#> 
#>   Parameterization                               Delta
#>   Standard errors                        Robust.cluster
#>   Information                                  Observed
#>   Observed information based on                 Hessian
#> 
#> Latent Variables:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>   eta1 =~                                             
#>     y1                0.695    0.070    9.919    0.000
#>     y2                0.741    0.064   11.510    0.000
#>     y3                0.448    0.058    7.757    0.000
#>     y4                0.450    0.061    7.344    0.000
#>     y5                0.306    0.061    5.044    0.000
#> 
#> Thresholds:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>     y1|t1             1.379    0.058   23.804    0.000
#>     y2|t1             0.488    0.040   12.167    0.000
#>     y3|t1             0.123    0.034    3.601    0.000
#>     y4|t1             0.751    0.039   19.093    0.000
#>     y5|t1             1.133    0.049   22.908    0.000
#> 
#> Variances:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>    .y1                0.517                           
#>    .y2                0.450                           
#>    .y3                0.799                           
#>    .y4                0.798                           
#>    .y5                0.906                           
#>     eta1              1.000

^{Created on 2024-04-26 with reprex v2.0.2}

yrosseel commented 5 months ago

Great! Thanks a lot. Merged now.

haziqj commented 5 months ago

Thanks Yves!

yrosseel / lavaan

Pairwise likelihood cluster robust standard errors #340