yrosseel / lavaan

an R package for structural equation modeling and more
http://lavaan.org
432 stars 100 forks source link

simulateData with standardized=TRUE #46

Closed jtextor closed 1 day ago

jtextor commented 8 years ago

The function "simulateData" is currently not able to provide standardized output in the general case, as the following example demonstrates:

lav <- lavaanify( "
    y ~ (-.5) * x + (.25) * z
    x ~ .5*U
    z ~ .5*U
")
diag(cov(simulateData( lav, standardized=TRUE, sample.nobs=100000 )))

The variance of y far exceeds 1.

ylenio commented 8 years ago

Also, the skewness and kurtosis arguments in simulateData produce empirical skewness and kurtosis that do not always match the simulated values. For example, often empirical skewness and kurtosis values are much lower than the expected (i.e. simulated) values.

philchalmers commented 7 years ago

@ylenio that property may be related to this pull request: https://github.com/yrosseel/lavaan/pull/79. But in general this data generation method isn't perfect.

yrosseel commented 2 years ago

5 years later, this is still not fixed. In that period, two people told me they would take this up, but nothing happened so far. It would seem I have to fix this myself.

TDJorgensen commented 2 years ago

I thought the simstandard package addressed this

https://cran.r-project.org/web/packages/simstandard/vignettes/simstandard_tutorial.html

yrosseel commented 2 years ago

Indeed. So maybe lavaan should just remove the standardized= argument in simulateData() (for now).

yrosseel commented 1 day ago

I needed to make changes to the (very old) setResidualElements.LISREL() function in order to make the correlation = TRUE option more robust.

It turns out the same 'trick' was needed to fix the standardized = TRUE option in simulateData(). The problem is to find those values for the residual variances in the PSI matrix, so that IB.inv %*% PSI %*% t(IB.inv) is a correlation matrix (at least for the y variables). It would seem there are two settings:

1) The Beta matrix is acyclic (ie. no cycles); then, for each y variable, we need to find its ancestors, and then we can compute the total variance of the RHS of the regression (without the error term), and then we can find the residual exactly; as almost all models in SEM are acyclic, it is nice to have an exact method

2) The beta matrix is not acyclic; I don't think an analytic solution is possible (but I may be wrong); I then cast this as a (nonlinear) optimization problem, and let nlminb() do the hard work.

Ironically, this is fixed just as I was about the replace the simulateData() function, and replace it by another (and more powerful) version. Oh well.

To try it out with the example that failed:

model <- '
  y ~ (-.5) * x + (.25) * z
  x ~ .5*U
  z ~ .5*U
'
Data <- simulateData(model, empirical = TRUE, return.fit = TRUE,
                     standardized = TRUE)
fit <- attr(Data, "fit")
lavInspect(fit, "implied")$cov

results in

       y      x      z      U
y  1.000                     
x -0.438  1.000              
z  0.125  0.250  1.000       
U -0.125  0.500  0.500  1.000

In the simstandard tutorial, the following example is given:

model <- '
  Y ~ -.75 * X_1 + .25 * X_2
  X =~ .75 * X_1 + .75 * X_2
'
Data <- simulateData(model, empirical = TRUE, return.fit = TRUE,
                     standardized = TRUE)
fit <- attr(Data, "fit")
lavInspect(fit, "implied")$cov

which results in

       X_1    X_2      Y
X_1  1.000              
X_2  0.562  1.000       
Y   -0.609 -0.172  1.000

If you can find any cases that seem to fail, let me know. Finally closing this very old issue.