cicadawing commented 4 months ago

I was attempting to run lavPredictY() on my multigroup analysis, and ended up with some issues with "missing variable names".

I have different regressions for my different groups.

The variable names were not missing within my data - the problem persisted even if I did not supply a dataframe to lavPredictY().

I ran a similar example with the Holzinger data, where x5 only appears in group 1, and ran into a similar issue. Is this a bug - or perhaps, this approach is intractable, and thus not supported?

HS.model <- " group: Grant-White x4 ~ x5 visual =~ x1 + x2 + x3 + x7 group: Pasteur x1 ~ x3 visual =~ x1 + x2 + x3 + x4 "

fit <- cfa( HS.model, data = HolzingerSwineford1939, group = "school" )

lavaan 0.6.17 ended normally after 56 iterations Estimator ML Optimization method NLMINB Number of model parameters 28 Number of observations per group: Pasteur 156 Grant-White 145 Model Test User Model: Test statistic 51.261 Degrees of freedom 11 P-value (Chi-square) 0.000 Test statistic for each group: Pasteur 50.404 Grant-White 0.857

lavPredictY( fit, ynames = lavNames(fit, "ov.y"), xnames = lavNames(fit, "ov.x") )

Error in lavPredictY(fit, ynames = lavNames(fit, "ov.y"), xnames = lavNames(fit, : lavaan ERROR: some variable names in xnames do not appear in the dataset: x5

Second attempt (with dataframe)

lavPredictY( fit, HolzingerSwineford1939, ynames = lavNames(fit, "ov.y"), xnames = lavNames(fit, "ov.x") ) Error in lavPredictY(fit, HolzingerSwineford1939, ynames = lavNames(fit, : lavaan ERROR: some variable names in xnames do not appear in the dataset: x5

yrosseel commented 3 months ago

As long as there are no equality (or other) constraints across the groups, I would recommend fitting the model (and do prediction) for each group separately.

I am not sure if we should 'fix' this. The xnames= argument simply expects (at the moment) that the predictor variables are present in all groups. What should we do if this is not the case? Pick the ones that we can find? I find this a bit strange. What is the use case for this? What do you think is the 'right' behavior in this case?

TDJorgensen commented 3 months ago

By explicitly passing the original data to newdata=, the reprex essentially just mimics the default behavior. I agree the default behavior should not change, which is to generate predicted values for the entire set of original data.

For a specialized model that has different variables in different groups, it should be up to the user to provide newdata= for each group, so that the xnames= and ynames= can be specified per group. However, even that is not possible in the current implementation:

HS1 <- HolzingerSwineford1939[HolzingerSwineford1939$school == "Pasteur", ]
HS2 <- HolzingerSwineford1939[HolzingerSwineford1939$school == "Grant-White", ]

## Both of these yield:
## Error: lavaan->lav_data_full():  
##   model syntax defines multiple groups; data suggests a single group

lavPredictY(fit, newdata = HS1, 
            ynames = lavNames(fit, "ov.y", group = 1), 
            xnames = lavNames(fit, "ov.x", group = 1))

lavPredictY(fit, newdata = HS2, 
            ynames = lavNames(fit, "ov.y", group = 2), 
            xnames = lavNames(fit, "ov.x", group = 2))

I think this is due to the use of lavData() to check the newdata= has properties that match the original data. I don't know if it would be a simple task to update how lavData() works (e.g., to selectively return a @Data slot with 1 or a subset of groups, which could be checked for in the newdata[group] vector), or whether there is a different way to validate newdata= only for for the group(s) in newdataa= (e.g., just checking lavNames(object, group=) for each group for which predictions are requested).

yrosseel / lavaan

lavPredictY issue with multigroup models (different structures) #369

Second attempt (with dataframe)