yrosseel / lavaan

an R package for structural equation modeling and more
http://lavaan.org
435 stars 99 forks source link

Multilevel SEM with `missing = "fiml"` failed when cluster ID is not ordered #379

Closed marklhc closed 2 months ago

marklhc commented 3 months ago

I've been having some convergence issues when running multilevel SEM with FIML. One observation I have is that the estimation is more likely to fail when the cluster ID is not sorted. Below is a reproducible example. As you can see, I randomly created missing values on 10 observations on one variable with the built-in Demo.twolevel data set, and both listwise deletion and FIML work. However, when I change the order of the observations, only listwise deletion works, but FIML fails. Does the multilevel SEM estimation currently requires the cluster ID to be sorted? Thank you very much.

library(lavaan)
#> This is lavaan 0.6-19.2180
#> lavaan is FREE software! Please report any bugs.
data(Demo.twolevel)
# Randomly make 10 observations missing in y1
dat1 <- Demo.twolevel
dat1$y1[sample(nrow(dat1), 10)] <- NA
summary(dat1)
#>        y1                 y2                 y3                 y4          
#>  Min.   :-7.95042   Min.   :-5.16954   Min.   :-6.00250   Min.   :-5.42627  
#>  1st Qu.:-1.17093   1st Qu.:-1.02755   1st Qu.:-0.94635   1st Qu.:-0.92783  
#>  Median : 0.03189   Median :-0.03823   Median :-0.01651   Median : 0.10881  
#>  Mean   : 0.02617   Mean   :-0.02417   Mean   :-0.02362   Mean   : 0.06407  
#>  3rd Qu.: 1.21427   3rd Qu.: 0.97475   3rd Qu.: 0.87433   3rd Qu.: 1.12141  
#>  Max.   : 5.99062   Max.   : 6.15177   Max.   : 5.91458   Max.   : 5.38516  
#>  NA's   :10                                                                 
#>        y5                 y6                 x1                  x2           
#>  Min.   :-5.49878   Min.   :-6.61292   Min.   :-3.588218   Min.   :-3.959398  
#>  1st Qu.:-0.83120   1st Qu.:-0.84995   1st Qu.:-0.651270   1st Qu.:-0.691653  
#>  Median : 0.10035   Median : 0.01773   Median : 0.000886   Median : 0.026444  
#>  Mean   : 0.07786   Mean   : 0.01233   Mean   :-0.007445   Mean   :-0.002891  
#>  3rd Qu.: 1.00385   3rd Qu.: 0.91615   3rd Qu.: 0.635861   3rd Qu.: 0.647923  
#>  Max.   : 5.13682   Max.   : 5.48243   Max.   : 3.133739   Max.   : 3.675690  
#>                                                                               
#>        x3                 w1                  w2              cluster     
#>  Min.   :-3.58632   Min.   :-2.409879   Min.   :-3.01746   Min.   :  1.0  
#>  1st Qu.:-0.68828   1st Qu.:-0.614158   1st Qu.:-0.84969   1st Qu.: 51.0  
#>  Median : 0.02448   Median : 0.004384   Median :-0.08167   Median :100.5  
#>  Mean   : 0.02001   Mean   : 0.019837   Mean   :-0.09127   Mean   :101.0  
#>  3rd Qu.: 0.69373   3rd Qu.: 0.774702   3rd Qu.: 0.56239   3rd Qu.:151.0  
#>  Max.   : 3.86090   Max.   : 2.474085   Max.   : 2.19585   Max.   :200.0  
#> 
# Model from https://lavaan.ugent.be/tutorial/multilevel.html
model <- '
    level: 1
        fw =~ y1 + y2 + y3
        fw ~ x1 + x2 + x3
    level: 2
        fb =~ y1 + y2 + y3
        fb ~ w1 + w2
'
fit <- sem(model = model, data = dat1, cluster = "cluster")
fit_fiml <- sem(model = model, data = dat1, cluster = "cluster",
                missing = "fiml")

# Now, consider the same data, but not sorted by cluster ID
set.seed(2111)
random_order <- sample(nrow(dat1))
dat2 <- dat1[random_order, ]
summary(dat2)
#>        y1                 y2                 y3                 y4          
#>  Min.   :-7.95042   Min.   :-5.16954   Min.   :-6.00250   Min.   :-5.42627  
#>  1st Qu.:-1.17093   1st Qu.:-1.02755   1st Qu.:-0.94635   1st Qu.:-0.92783  
#>  Median : 0.03189   Median :-0.03823   Median :-0.01651   Median : 0.10881  
#>  Mean   : 0.02617   Mean   :-0.02417   Mean   :-0.02362   Mean   : 0.06407  
#>  3rd Qu.: 1.21427   3rd Qu.: 0.97475   3rd Qu.: 0.87433   3rd Qu.: 1.12141  
#>  Max.   : 5.99062   Max.   : 6.15177   Max.   : 5.91458   Max.   : 5.38516  
#>  NA's   :10                                                                 
#>        y5                 y6                 x1                  x2           
#>  Min.   :-5.49878   Min.   :-6.61292   Min.   :-3.588218   Min.   :-3.959398  
#>  1st Qu.:-0.83120   1st Qu.:-0.84995   1st Qu.:-0.651270   1st Qu.:-0.691653  
#>  Median : 0.10035   Median : 0.01773   Median : 0.000886   Median : 0.026444  
#>  Mean   : 0.07786   Mean   : 0.01233   Mean   :-0.007445   Mean   :-0.002891  
#>  3rd Qu.: 1.00385   3rd Qu.: 0.91615   3rd Qu.: 0.635861   3rd Qu.: 0.647923  
#>  Max.   : 5.13682   Max.   : 5.48243   Max.   : 3.133739   Max.   : 3.675690  
#>                                                                               
#>        x3                 w1                  w2              cluster     
#>  Min.   :-3.58632   Min.   :-2.409879   Min.   :-3.01746   Min.   :  1.0  
#>  1st Qu.:-0.68828   1st Qu.:-0.614158   1st Qu.:-0.84969   1st Qu.: 51.0  
#>  Median : 0.02448   Median : 0.004384   Median :-0.08167   Median :100.5  
#>  Mean   : 0.02001   Mean   : 0.019837   Mean   :-0.09127   Mean   :101.0  
#>  3rd Qu.: 0.69373   3rd Qu.: 0.774702   3rd Qu.: 0.56239   3rd Qu.:151.0  
#>  Max.   : 3.86090   Max.   : 2.474085   Max.   : 2.19585   Max.   :200.0  
#> 
fit2 <- sem(model = model, data = dat2, cluster = "cluster")
fit2_fiml <- sem(model = model, data = dat2, cluster = "cluster",
                 missing = "fiml")
#> Warning: lavaan->lav_lavaan_step11_estoptim():  
#>    Model estimation FAILED! Returning starting values.

#> Warning: lavaan->lav_lavaan_step11_estoptim():  
#>    Model estimation FAILED! Returning starting values.
#> Warning: lavaan->lav_lavaan_step15_baseline():  
#>    estimation of the baseline model failed.

Created on 2024-08-15 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.1 (2024-06-14) #> os Ubuntu 22.04.4 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Los_Angeles #> date 2024-08-15 #> pandoc 3.1.9 @ /usr/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1) #> digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1) #> evaluate 0.23 2023-11-01 [3] CRAN (R 4.3.2) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.1) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.1) #> glue 1.7.0 2024-01-09 [3] CRAN (R 4.3.2) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1) #> knitr 1.45 2023-10-30 [3] CRAN (R 4.3.2) #> lavaan * 0.6-19.2180 2024-08-16 [1] Github (yrosseel/lavaan@56b8a11) #> lifecycle 1.0.4 2023-11-07 [3] CRAN (R 4.3.2) #> magrittr 2.0.3 2022-03-30 [3] CRAN (R 4.2.0) #> mnormt 2.1.1 2022-09-26 [3] CRAN (R 4.2.1) #> pbivnorm 0.6.0 2015-01-23 [3] CRAN (R 4.0.2) #> purrr 1.0.2 2023-08-10 [3] CRAN (R 4.3.1) #> quadprog 1.5-8 2019-11-20 [3] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [3] CRAN (R 4.3.2) #> R.methodsS3 1.8.2 2022-06-13 [3] CRAN (R 4.2.1) #> R.oo 1.26.0 2024-01-24 [3] CRAN (R 4.3.2) #> R.utils 2.12.3 2023-11-18 [3] CRAN (R 4.3.2) #> reprex 2.1.0 2024-01-11 [3] CRAN (R 4.3.2) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.1) #> rmarkdown 2.25 2023-09-18 [3] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.2.0) #> styler 1.10.2 2023-08-29 [3] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [3] CRAN (R 4.3.2) #> withr 3.0.0 2024-01-16 [3] CRAN (R 4.3.2) #> xfun 0.41 2023-11-01 [3] CRAN (R 4.3.2) #> yaml 2.3.8 2023-12-11 [3] CRAN (R 4.3.2) #> #> [1] /home/markl/R/x86_64-pc-linux-gnu-library/4.4 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
yrosseel commented 2 months ago

Indeed. That is not ok. Something must be wrong with the internal indices. I will investigate.

yrosseel commented 2 months ago

Ok. After many hours, I think I have finally found the problem. In lav_data_patterns.R (around line 80), we obtain the unique cluster numbers per missing pattern, and a few lines later, we use table() to count the number of cases for this cluster number. But table() seems to automatically sort() the cluster numbers first, leading to a mismatch in some cases...

Fixed on github.

Can you confirm this works for you?

marklhc commented 2 months ago

Thank you @yrosseel. I just checked my example and a few others and it works! Ya I remember table() sometimes have unexpected behaviors.

Closing this now.