Open SiPineau opened 9 months ago
This question is probably more suited for the lavaan discussion group. But one can also see this as a request for better documentation. One day, I will write a decent manual.
First of all, lavaan assumes you have a CFA model. Any structural components of the model (i.e., regressions among latent variables) are ignored. This is by design.
The default is type = "lv", which implies that we try to assign values to the latent variables (i.e., factor scores), given the data, and given the estimated model parameters. When data is continuous, the 'classic' approaches are 'regression' factor scores and Bartlett factor scores. (i.e., method = "regression" and method = "Bartlett"). A good reference is:
Grice, J. W. (2001). A comparison of factor scores under conditions of factor obliquity. Psychological Methods, 6, 67-83.
From a computational point of view, lavaan can handle zero elements in Theta. We do this by using Sigma instead of Theta in the Bartlett computations. See:
Bentler & Yuan (1997) 'Optimal Conditionally Unbiased Equivariant Factor Score Estimators' in Berkane (Ed) 'Latent variable modeling with applications to causality' (Springer-Verlag)
lavaan can also handle missing values (in the indicators); the factor scores will be complete. Not sure if there is a good reference for this. Finally, in the continuous case, EB=EBM=regression, while ML=Bartlett. (EB = Empirical Bayes, EBM = Empirical Bayes Modal).
When data is categorical, lavaan uses the approach described in the Mplus technical appendices (http://www.statmodel.com/download/techappen.pdf) page 48. Here, we need an iterative procedure per observation, which is much more computationally intensive. In fact, eq 231 only describes the EBM (or regression) approach. If you remove the first term (before the minus sign), you get ML. But the latter often gives numerical issues, so lavaan uses a vague prior to avoid Inf values when using method = "ML".
A decent (but technical) description of all these methods can be found in Chapter 7 of the 2004 book:
Generalized Latent Variable Modeling Multilevel, Longitudinal, and Structural Equation Models" by Skrondal & Rabe-Hesketh (https://www.routledge.com/Generalized-Latent-Variable-Modeling-Multilevel-Longitudinal-and-Structural/Skrondal-Rabe-Hesketh/p/book/9781584880004).
If type = "ov" or "yhat", we first compute factor scores, and then use the formula $\Lambda \eta$ to compute predicted values for the observed indicators (y).
If type = "resid", we first compute the 'yhat' values, and compute the difference with the actual observed (y) values. These residual can be used for diagnostics, such as detecting outlying observations. (See also the new mdist = argument).
In any case, the man page of lavPredict() needs more info. I will keep this issue open until that is done.
Hi,
I would like to better understand how the factor scores are estimated in the lavPredict function depending on the type and method arguments.
I didn't find references in rdrr.io or in CRAN. Could you advise me some references or show me where I can find the way lavPredict compute factor scores ?
Thank you, have a nice day.
Simon Pineau