Handing of missing data in method "fd" / time-wise first-differences

jkhanson1970 commented 2 years ago

Greetings,

It appears that method "fd" will calculate a first difference even when data points for previous time periods are missing. I noticed this when trying to get R and Stata to produce the same results for a model with first differences, in which plm with the fd method reported having more cases in the estimation.

Specifically, if data points in the middle of a panel series are missing, method fd will still construct a first difference by going back to the most recent period in which there is data, rather than drop the case.

I illustrate in the attached code using the EmplUK dataset. I set specific data points to missing. I then construct the first differences manually. Then, I compare fd using the source data in levels to method random using the first differenced data.

Kind regards,

Jon Hanson

library(plm)

data("EmplUK" , package = "plm")
Em <- pdata.frame(EmplUK, index = c("firm", "year"))

# make selected missing data NOT in first year
Em$capital[Em$firm == 1 & Em$year == 1981] <- NA
Em$capital[Em$firm == 2 & Em$year == 1981] <- NA
Em$capital[Em$firm == 2 & Em$year == 1982] <- NA

# make first differences directly
Em$d.emp <- diff(log(Em$emp))
Em$d.wage <- diff(log(Em$wage))
Em$d.capital <- diff(log(Em$capital))

# estimate model with "fd"
mod.fd <- plm(log(emp) ~ log(wage) + log(capital), data = Em, model = "fd")
summary(mod.fd)
length(mod.fd$residuals)

# estimate model with first differences; edited based on response from tappek
mod.alt <- plm(d.emp ~ d.wage + d.capital, data = Em, model = "pooling")
summary(mod.alt)
length(mod.alt$residuals)

tappek commented 2 years ago

Yes, you are right. This is (poorly) document in the NEWS for version 1.7-0 when the lag, lead, and diff functions for panel data were enhanced (and defaulted) to allow time-wise shifting:

Note that, however, the diff-ing performed in first-difference model estimation by plm(..., model = "fd") is based on row-wise differences of the model matrix per individual.

Up until now, no one got around to implement this for plm's FD estimation. A workaround is to diff the data yourself before estimation. This seems like what you want to suggest with your examples, but note that you put model = "random" where you want model = "pooling", i.e., set mod.alt <- plm(d.emp ~ d.wage + d.capital, data = Em, model = "pooling").

jkhanson1970 commented 2 years ago

Thank you for the quick reply. I appreciate the clarification, as well as the correction to use "pooling" for the alternative model.

I am not sure, but I think that the row-wise differencing in the FD routine has downstream effects. For example, pwfdtest also gives me different estimation sample sizes than the counterpart in Stata when using the same data. That's what first alerted me to the situation.

tappek commented 2 years ago

Yes, this is correct as well (as these tests internally use plm(., model = "fd")).

tappek commented 2 years ago

Steps to get there:

[x] enable internal function pdiff (in base R) for time-wise shifting
[x] enable internal function pdiff (fast collapse version) for time-wise shifting
[x] implement wrapper for switch between pdiff's row- and time-wise shifting
[ ] adjust all first-diff index adjustments to loss of periods in time-wise working mode (there is a non-exproted stub make.fdindex as a helper function)
- [ ] vcovG,
- [ ] pwartest,
- [ ] pwfdtest,
- [ ] pggls,
- [ ] (others?)
[ ] switch default FD model estimation to time-wise (in model.matrix.pdata.frame and ptransform)

ycroissant / plm

Handing of missing data in method "fd" / time-wise first-differences #27