Residuals are automatically sorted according to the values of the group variable

ghost commented 2 years ago

I am using the State Traffic Fatality Data Set from Chapter 10 of Stock and Watson's Introduction to Econometrics (3rd edition), which can be found here. In my dataset, the order of the group variable state is not sorted:

> Fatalities2$state
  [1] al al al al al al al az az az az az az az ar ar ar ar ar ar ar ca ca ca ca ca ca ca co co co co co co co ct ct ct ct ct ct ct de de de de de de de fl fl fl fl fl fl fl ga
 [58] ga ga ga ga ga ga id id id id id id id il il il il il il il in in in in in in in ia ia ia ia ia ia ia ks ks ks ks ks ks ks ky ky ky ky ky ky ky la la la la la la la me me
[115] me me me me me md md md md md md md ma ma ma ma ma ma ma mi mi mi mi mi mi mi mn mn mn mn mn mn mn ms ms ms ms ms ms ms mo mo mo mo mo mo mo mt mt mt mt mt mt mt ne ne ne
[172] ne ne ne ne nv nv nv nv nv nv nv nh nh nh nh nh nh nh nj nj nj nj nj nj nj nm nm nm nm nm nm nm ny ny ny ny ny ny ny nc nc nc nc nc nc nc nd nd nd nd nd nd nd oh oh oh oh
[229] oh oh oh ok ok ok ok ok ok ok or or or or or or or pa pa pa pa pa pa pa ri ri ri ri ri ri ri sc sc sc sc sc sc sc sd sd sd sd sd sd sd tn tn tn tn tn tn tn tx tx tx tx tx
[286] tx tx ut ut ut ut ut ut ut vt vt vt vt vt vt vt va va va va va va va wa wa wa wa wa wa wa wv wv wv wv wv wv wv wi wi wi wi wi wi wi wy wy wy wy wy wy wy
Levels: al ar az ca co ct de fl ga ia id il in ks ky la ma md me mi mn mo ms mt nc nd ne nh nj nm nv ny oh ok or pa ri sc sd tn tx ut va vt wa wi wv wy

For example, az is in front of ar, which is not following the alphabetical order. It is more obvious if we check the numeric values of state

> as.numeric(Fatalities2$state)
  [1]  1  1  1  1  1  1  1  3  3  3  3  3  3  3  2  2  2  2  2  2  2  4  4  4  4  4  4  4  5  5  5  5  5  5  5  6  6  6  6  6  6  6  7  7  7  7  7  7  7  8  8  8  8  8  8  8  9
 [58]  9  9  9  9  9  9 11 11 11 11 11 11 11 12 12 12 12 12 12 12 13 13 13 13 13 13 13 10 10 10 10 10 10 10 14 14 14 14 14 14 14 15 15 15 15 15 15 15 16 16 16 16 16 16 16 19 19
[115] 19 19 19 19 19 18 18 18 18 18 18 18 17 17 17 17 17 17 17 20 20 20 20 20 20 20 21 21 21 21 21 21 21 23 23 23 23 23 23 23 22 22 22 22 22 22 22 24 24 24 24 24 24 24 27 27 27
[172] 27 27 27 27 31 31 31 31 31 31 31 28 28 28 28 28 28 28 29 29 29 29 29 29 29 30 30 30 30 30 30 30 32 32 32 32 32 32 32 25 25 25 25 25 25 25 26 26 26 26 26 26 26 33 33 33 33
[229] 33 33 33 34 34 34 34 34 34 34 35 35 35 35 35 35 35 36 36 36 36 36 36 36 37 37 37 37 37 37 37 38 38 38 38 38 38 38 39 39 39 39 39 39 39 40 40 40 40 40 40 40 41 41 41 41 41
[286] 41 41 42 42 42 42 42 42 42 44 44 44 44 44 44 44 43 43 43 43 43 43 43 45 45 45 45 45 45 45 47 47 47 47 47 47 47 46 46 46 46 46 46 46 48 48 48 48 48 48 48

plm() gives the correct regression results

> my_model_panel2 <- plm(mrall ~ beertax, 
+                       data = Fatalities2,
+                       index = c("state", "year"), 
+                       model = "within")
> summary(my_model_panel2)
Oneway (individual) effect Within Model

Call:
plm(formula = mrall ~ beertax, data = Fatalities2, model = "within", 
    index = c("state", "year"))

Balanced Panel: n = 48, T = 7, N = 336

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.5869619 -0.0828376 -0.0012701  0.0795454  0.8977960 

Coefficients:
        Estimate Std. Error t-value Pr(>|t|)    
beertax -0.65587    0.18785 -3.4915 0.000556 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    10.785
Residual Sum of Squares: 10.345
R-Squared:      0.040745
Adj. R-Squared: -0.11969
F-statistic: 12.1904 on 1 and 287 DF, p-value: 0.00055597

However, if we calculate the fitted values using

> Fatalities2$mrall_hat <- Fatalities2$mrall - my_model_panel2$residuals

and plot the regression lines using

> library(ggplot2)
> ggplot(Fatalities2) +
+   geom_point(aes(y = mrall, x = beertax, color = state), size = 2, show.legend = FALSE) + 
+   geom_line(aes(y = mrall_hat, x = beertax, color = state), show.legend = FALSE)

we get Clearly, the fitted values are not correct. The reason is my_model_panel2$residuals is sorted according to the values of state. For example, the residuals for state="az" is put in my_model_panel2$residuals[15:21], but it suppose to be in my_model_panel2$residuals[8:14] to match the order with the other variables.

If we sorted it back according to the order of state using

> myorder <- seq(1,nrow(Fatalities2))[order(Fatalities2$state)]
> residuals_reordered <- my_model_panel2$residuals[order(myorder)]

and then calculate the new fitted values and plot them, we have

tappek commented 2 years ago

Internally, plm (and other functions) use a data structure called pdata.frame which is sorted by individual (first sort criterion) and time dimension (second sort criterion). It is recommended to create a pdata.frame of your data beforehand (it is sorted) and then use that. The sorting will be the same then.

Something along these lines:

pFatalities <- pdata.frame(Fatalities, index = c("state", "year"))
mod <- plm(mrall ~ beertax, data = pFatalities, index = c("state", "year"), model = "within")

However, note that for arbitrary data the residuals might not fit to the original data as model estimation drops observations with NA and NaN values. To match the data correctly, you would need to merge them by individual and time index. Or use the data as used in estimation in mod$model.

ghost commented 2 years ago

Many thanks!

ycroissant / plm

Residuals are automatically sorted according to the values of the group variable #16