yrosseel / lavaan

an R package for structural equation modeling and more
http://lavaan.org
428 stars 98 forks source link

error with haven_labelled variable type #163

Closed jsraadt closed 4 years ago

jsraadt commented 4 years ago

I imported a .sav (spss) dataset using the haven package, fit a model using sem(), and used lavPredict() on the fitted model. I got an error that says: lavaan ERROR: unknown type: haven_labelled for variable: .... Just bringing this to someone's attention, because the issue goes away after saving the .sav as .csv.

TDJorgensen commented 4 years ago

haven is part of the "tidyverse", and read_sav() imports files as a tbl ("tibble") object, like everything else in the tidyverse. To the utter dismay of R programmers outside Hadley Wickham's influential little clique, tbl objects do not tell R that they "inherit" from the data.frame object, which would save hundreds of users like yourself the trouble of figuring out error messages like this. Two solutions: You could coerce it back to a data.frame after importing:

myData <- as.data.frame(read_sav("myFile.sav"))

Or you could simply use what R already provides with its base distribution:

library(foreign)
myData <- read.spss("myFile.sav", to.data.frame = TRUE,
                    # if you plan to treat Likert items as ordinal or numeric:
                    use.value.labels = FALSE)
yrosseel commented 4 years ago

lavaan 0.6-6 will now always convert tibble data.frames to regular data.frames (and hope for the best).

lionel- commented 4 years ago

To the utter dismay of R programmers outside Hadley Wickham's influential little clique

What the hell.

larmarange commented 4 years ago

It seems that the reported error is not about using a tibble but rather by using a haven_labelled vector that should have been converted into a factor before analysis with lavaan.

lionel- commented 4 years ago

Agreed this is not about tibbles. We might be able to do better though, I posted an issue in haven. The haven-labelled class is conceptually a "mixin" class, and we don't have good tools for this kind of classes in R yet. Explicitly inheriting from the base class might solve some issues.

In any case, converting df inputs with as.data.frame() is a good thing because many subclasses of data.frame do not behave like base data frames (tibble, data.table, sf, ...).

TDJorgensen commented 4 years ago

What the hell.

Sorry for the pithy comment :-) I do appreciate Wickham's (group's) outstanding work creating RStudio, providing educational materials, and contributing R packages that make so many complex tasks easier for novice R users as well as experience programmers.

a haven_labelled vector that should have been converted into a factor

Exactly what can be frustrating about the tidyverse development. Is it really not possible for some of the obvious cases (e.g., tbl, labelled) to inherit from existing classes? It would save programmers outside the tidyverse a lot of time tracking down why weird things happen with superficially unfamiliar object classes.

larmarange commented 4 years ago

A tibble inherits of the data.frame class. Wich is not the case of haven_labelled.

Could be relevant to update that specific class

larmarange commented 4 years ago

Anyway, before modelling or plotting, the user should always specify if a variable should be treated as categorical or as continuous. The fact that there are labels in the imported data doesn't always imply that these variables should be considered as factors.

yrosseel commented 4 years ago

Just to be clear: apart from coercing 'data' to a data.frame (as is now the case in the dev version), is there anything more that lavaan should do?

lionel- commented 4 years ago

The labelled class is considered to be a temporary class, and the onus is on the user to convert labelled classes to factors before getting into data analysis. In any case, lavaan should not do anything. I am not sure how converting to data frame fixes the issue with inputs containing haven-labelled columns.

Regarding: https://github.com/yrosseel/lavaan/blob/290cc70ee05ccf7643cdc6d40e4c4fcad3ec8f53/R/xxx_lavaan.R#L64-L67

I recommend doing this for any classes that inherit from data frame. For instance data.table does data-masking among other special interpretation of subsetting. This could introduce scoping issues in your code. Another example, sf data frames maintain a sticky geometry column, which means that the usual invariant length(sf[1:n]) == n is not applicable.

If you convert inputs to a data frame, there is no need to hope for the best because you control exactly the interface you're using.

larmarange commented 4 years ago

It is unclear to see if data should be coerced to a data.frame. Is there really a problem with tibbles or is there a problem only with haven_labelled vectors?

What could be done is to check if any of the variables passed to lavaan is of class haven_labelled and then to display an error message: "Please convert any 'haven_labelled' variables into factor or into numeric."

It could be easily done by users, using:

to_factor() can be applied to an overall data.frame to transform all labelled vectors into factors, and the strict = TRUE argument can be used to convert into factors only when all values have a label.

larmarange commented 4 years ago

Typically, if you want to convert only labelled vectors into factors (if all values have a labels) or into numeric (if some values do not have a label), you easely use:

labelled::to_factor(df, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
yrosseel commented 4 years ago

@lionel: I agree that we should convert all objects that inherit from data.frame to data.frame (not just tibbles). Changed this. @larmarange: I want to avoid any additional package dependencies; but I like to do mroe testing: could you send me a small dataset that contains haven_labelled variables?

larmarange commented 4 years ago

You can use data(fertility, package = "questionr"), a set a 3 data frames with value labels.

Value labels are a way to provide metadata to a variable without assuming the type of variables (numeric or categorical). Often, they are used for categorical variables but it is not systematic and value labels can be used just to add metadata to a specific value of a continuous variable. Value labels allow preserving original coding when importing datasets from Stata, SAS or SPSS or for documenting survey datasets that will be re-exported at the end of a process of data transformation and recoding. Be aware that value labels could be added to a numeric or to a character vector.

However, value labels are not intended to be used for plotting or for modelling. For modelling, users have to convert haven_labelled vectors into factors or into numerical/character variables. It is the case for any kind of model with all packages.

I agree that lavaan should not add additional dependencies to handle with a type of vector that should have been converted by user before analysis. Therefore, if you want to do things explicitely for users, I suggest just to add some data checking with two possibilities:

I do not think that lavaan should do more. haven_labelled vectors are not supposed to be used in a model because we do not know if they should be treated as categorical or as continuous.

jsraadt commented 4 years ago

users have to convert haven_labelled vectors into factors or into numerical/character variables

To make the R Studio import from SPSS point-and-click feature use haven and then say it is the user's responsibility to convert the variable is just absurd. This is like a developer expecting users to be developers. It's a major reason why R has an adoption problem (-1% since 2018)

lionel- commented 4 years ago

Sorry to break it to you but there are technical problems that are hard to solve in sufficient generality. The current situation is a reasonable compromise for importing a foreign type in the R environment.

R programmer-users are expected to read error messages (here the message is very clear: haven_labelled is unknown type for lavaan) and documentation (https://haven.tidyverse.org/reference/labelled.html would quickly come up in a google search and explains how to convert to factor or numeric depending on your use case).

In any case, this is all terribly off-topic for lavaan. I don't think we should be discussing design choices in haven here.