dplyr "compatibility" (was: Results sensitive to data ordering)

ssoyounglee commented 1 year ago

Hello,

I've noticed that the fixed effects results from plm are sensitive to how the data is ordered.

The fixed-effects regression provides the expected results when the data is sorted by unit-time (unit first and then time), but provides different (wrong) coefficients when the data is sorted by time-unit (time first then unit).

I think it would make a great package even better if the results were not sensitive to how the data is sorted, or if at the very least there is an error message that pops out alerting researchers that their data is not sorted the right way if the data is not pre-arranged as unit-year.

Thanks!

zeileis commented 1 year ago

Thanks for raising this issue and also for putting together an illustration: https://yiqingxu.org/public/plm.html (My understanding is that this is your work.)

TL;DR The problem is due to using dplyr::arrange which does not work correctly for pdata.frame objects.

I took a quick look and it seems that the problem is not due to the different sorting! It's due to using arrange() which somehow sorts the data without using [ subsetting. Due to this the main data.frame and the index attribute are out of sync and produce incorrect results.

Adapting your example, I create three pdata.frames: The original sorted by state before year and two flavors sorted by year before state, one via standard subset [ and one via dplyr::arrange.

library("plm")
data("Produc", package = "plm")
p_orig <- pdata.frame(Produc, index = c("state", "year"))
p_subset <- p_orig[order(p_orig$year), ]
p_dplyr <- dplyr::arrange(p_orig, year, state)

The first two yield the same results compared to using a data.frame plus index:

coef(plm(pcap ~ gsp, data = Produc, index = c("state", "year"), model = "within"))
##       gsp 
## 0.1450668 
coef(plm(pcap ~ gsp, data = p_orig, model = "within"))
##       gsp 
## 0.1450668 
coef(plm(pcap ~ gsp, data = p_subset, model = "within"))
##       gsp 
## 0.1450668

But the dplyr version produces the erroneous results:

coef(plm(pcap ~ gsp, data = p_dplyr, model = "within"))
##       gsp 
## 0.3868028

The reason is that the dplyr version has the re-arranged data (equal to that in p_subset) but incorrectly preserved the original index (equal to that in p_orig):

all.equal(p_dplyr, p_subset)
## [1] "Attributes: < Component \"index\": Attributes: < Component \"row.names\": 814 string mismatches > >"
## [2] "Attributes: < Component \"index\": Component \"state\": 784 string mismatches >"                    
## [3] "Attributes: < Component \"index\": Component \"year\": 768 string mismatches >"                     
all.equal(index(p_dplyr), index(p_orig))
## [1] TRUE

Not sure what the best way forward is here. It's not clear to me which tools/methods arrange uses internally and whether plm could just provide suitable methods for these underlying tools. Alternatively, plm could provide methods for arrange and possibly other dplyr functions, possibly simply coercing to data.frame and dropping the index before calling the data.frame method.

tappek commented 1 year ago

Big thank you to the both of you, for spotting and taking the time for reporting and analysing!

In general, dplyr does not seem very friendly to non-base data structures (it simply cannot know all the things to adjust for these). My general recommendation would be to do all dplyr excercises on the original data (plain data frame) and convert to a pdata.frame only when no more dplyr-operations are to be done. I recently worked around non-index subsetting of dplyr::filter (currently only in dev version of plm) where it was possible to detect an ill-conditioned pdata.frame created by dplyr::filter (https://stackoverflow.com/questions/76020726/why-does-plm-not-like-my-dplyr-created-dataframe) and warn the user ("input data claims to be a pdata.frame but does not seem to have compliant properties, results can be unreliable. This can happen due to data manipulation by non-pdata.frame-aware functions (e.g., 'dplyr' on pdata.frame). Maybe re-create data input as fresh pdata.frame after last data manipulation with other tools."), shall you be interested the test file for this is this one: https://github.com/ycroissant/plm/blob/main/inst/tests/test_pdata.frame_compliant.R

However, for the case reported now it is not possible to tell ex post that the index does not match the data in the pdata.frame created by dplyr::arrange (as the variables used to create the index are not guaranteed to stay in the data (pdata.frame(<.>, drop.index = TRUE) and rownames could be changed by argument row.names).

Always creating a fresh pdata.frame inside plm() (and other estimation functions) would impose a performance panelty if multiple plm() calls for the same data are executed.

An arrange-method for pdata.frames in plm package could do the trick:

 arrange.pdata.frame <- function(.data, ..., .by_group = FALSE, .locale = NULL) {
  idx <- index(.data)
  ag.df  <- dplyr:::arrange.data.frame(.data, ..., .by_group = FALSE, .locale = NULL)
  ag.idx <- dplyr:::arrange.data.frame(idx,   ..., .by_group = FALSE, .locale = NULL)
  attr(ag.df, "index") <- ag.idx
  ag.df
}

p_dplyr2 <- dplyr::arrange(p_orig, year, state)
coef(plm(pcap ~ gsp, data = p_dplyr2, model = "within"))
##  gsp 
## 0.1450668

zeileis commented 1 year ago

Thanks, Kevin, this makes sense. If there were suitable methods to the dplyr generics for pdata.frame that could help end-users, I think. And if it's not possible to provide a method that works correctly, it would still be helpful to just stop with an informative error.

tappek commented 1 year ago

I forgot to put a message here: I commited something May 27 to dev version, to handle this but I will need to look more closely if this would be accepted by CRAN as this involves not adding a dependency to dyplr resulting in a NOTE by R CMD check:

* checking R code for possible problems ... [44s] NOTE
arrange.pdata.frame: no visible global function definition for
  'arrange'
Undefined global functions or variables:
  arrange

ssoyounglee commented 1 year ago

Thanks Kevin, for this update! Appreciate it!

Best, Soyoung

On Thu, Aug 3, 2023 at 7:22 AM Kevin Tappe @.***> wrote:

I forgot to put a message her: I commited something May 27 to dev version, to handle this but I will need to look more closely if this would be accepted by CRAN as this involves not adding a dependency to dyplr resulting a NOTE by R CMD check:

checking R code for possible problems ... [44s] NOTE arrange.pdata.frame: no visible global function definition for 'arrange' Undefined global functions or variables: arrange

— Reply to this email directly, view it on GitHub https://github.com/ycroissant/plm/issues/46#issuecomment-1663806584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXMX5LAJLWAWKNHP7C4CUT3XTOCYLANCNFSM6AAAAAAYMZQNYM . You are receiving this because you authored the thread.Message ID: @.***>

zeileis commented 1 year ago

Kevin, thanks. I think you need to include dplyr in Suggests for this, e.g., like this: https://github.com/ycroissant/plm/pull/48

tappek commented 1 year ago

Thank you, Achim, for taking the time to look into this!

I am aware of the Suggests-dependency solution, but should have made that more clear in my last post. See: https://github.com/ycroissant/plm/blob/8dfd6b0350761f870ee23a08ad185276d35c3ca6/R/tool_pdata.frame.R#L1547

Adding a dependency is what I want to avoid. Given you did not come up with another solution, I will assume there is no such solution in the R world ;). And I acknowledge, "silently" (in the sense of non-dependency adding) extending another package is hack-ish and, thus, unsupported by R.

zeileis commented 1 year ago

Ah, I missed that you were aware that a "Suggests" dependency is sufficient. Personally, I would think that this would be acceptable.

Possibly, you can get around the dependency by adding a utils::globalVariables("arrange") in the code. I didn't try that. In any case, it would be clear that this is not "as intended" and it's not unlikely that future versions of R CMD check detect this and complain about it.

tappek commented 1 year ago

Indeed, utils::globalVariables("arrange") does the trick, so included. I was not aware of this function - thank you!

tappek commented 1 year ago

I will leave this issue open as a reminder to look into more dplyr verbs possibly "conflicting"/giving unexpected results with pdata.frame.

zeileis commented 1 year ago

Thanks, Kevin. Just for the record: I'm fairly sure that the solution is not "clean" and it will lead to further discussions at some point. But maybe that's sufficiently far in the future 🤞

ycroissant / plm

dplyr "compatibility" (was: Results sensitive to data ordering) #46